WO2021196934A1 - Question recommendation method and apparatus based on field similarity calculation, and server - Google Patents

Question recommendation method and apparatus based on field similarity calculation, and server Download PDF

Info

Publication number
WO2021196934A1
WO2021196934A1 PCT/CN2021/078031 CN2021078031W WO2021196934A1 WO 2021196934 A1 WO2021196934 A1 WO 2021196934A1 CN 2021078031 W CN2021078031 W CN 2021078031W WO 2021196934 A1 WO2021196934 A1 WO 2021196934A1
Authority
WO
WIPO (PCT)
Prior art keywords
field
fields
similarity
target
string
Prior art date
Application number
PCT/CN2021/078031
Other languages
French (fr)
Chinese (zh)
Inventor
赵亮
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021196934A1 publication Critical patent/WO2021196934A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • This application belongs to the field of artificial intelligence technology, and in particular relates to a problem recommendation method, device, storage medium, and server based on field similarity calculation.
  • the working principle of an intelligent question answering system based on natural language is usually that the user enters a question sentence, the intelligent question answering system performs natural language processing on the question sentence, generates a structured query language, and then transfers the structured query language to the database or knowledge base according to the structured query language. Find the content of the reply, and finally return the query result to the user.
  • this application proposes a question recommendation method, device, storage medium and server based on field similarity calculation, which can improve the accuracy of the question recommendation of the intelligent question answering system.
  • an embodiment of the present application provides a method for problem recommendation based on field similarity calculation, including:
  • the field with the highest similarity among the other fields is selected, and the target field in the first question sentence is replaced to obtain a recommended second question sentence.
  • an embodiment of the present application provides a question recommendation device based on field similarity calculation, including:
  • the question acquisition module is used to acquire the input first question sentence
  • the word segmentation module is used to perform word segmentation processing on the first question sentence and extract each field contained therein;
  • a field comparison module which is used to compare each field one by one with the fields in the pre-built field data table, find out the same fields that each field and the field data table have, and determine it as a target field;
  • a field similarity calculation module configured to calculate the similarity between the target field and each other field in the field data table except the target field
  • the question recommendation module is configured to select the field with the highest similarity among the other fields, replace the target field in the first question sentence, and obtain a recommended second question sentence.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it implements what is proposed in the first aspect of the embodiments of the present application.
  • the steps of the problem recommendation method are described in detail below.
  • an embodiment of the present application provides a server, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program when the computer program is executed.
  • the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the steps of the problem recommendation method described in the first aspect.
  • the question recommendation method based on field similarity calculation proposed in this application after extracting each field of the input question sentence, each field will be compared with the fields in the pre-built field data table one by one to find out the extracted The field and the same field in the field data table are determined as the target field; then, the similarity between the target field and each other field in the field data table is calculated separately, and the field with the highest similarity is found, and the question statement Replace the target field in to get the recommended question.
  • this application comprehensively considers the similarity between each preset field, and replaces the field in the original question sentence with the field with the highest similarity, which can generate more New question sentences that meet user expectations and improve the accuracy of the intelligent question answering system's recommended questions.
  • FIG. 1 is a flowchart of a first embodiment of a problem recommendation method provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a second embodiment of a question recommendation method provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a third embodiment of a problem recommendation method provided by an embodiment of the present application.
  • FIG. 4 is a structural diagram of an embodiment of a problem recommendation device provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a server provided by an embodiment of the present application.
  • This application proposes a question recommendation method, device, storage medium, and server, which can improve the accuracy of the question recommendation by the intelligent question answering system.
  • a first embodiment of a method for recommending a question based on field similarity calculation in an embodiment of the present application includes:
  • the user can input the question to be asked by voice input or manually on the terminal device, that is, the first question sentence, and the question sentence will be sent to the intelligent question answering system on the server side.
  • the server After the server obtains the question sentence, it will segment the question sentence and extract the various fields it contains.
  • word segmentation various different types of word segmentation methods in the prior art can be used. For example, jieba word segmentation can be used. If the user asks: “What is the average age of men in different occupations?", after using jieba word segmentation, Get the field list["male", “different”, “occupation”, “average”, “age”, “how”, “?”].
  • the server compares each field with the fields in the pre-built field data table one by one, and finds out the various fields and the fields in the field data table. The same field is determined as the target field.
  • the pre-built field data table can be as shown in Table 1 below:
  • these fields can be added to jieba's custom dictionary , so that the keywords in these fields will not be cut when the question sentence entered by the user is segmented. For example, for the field keyword "personal monthly income after tax”, jieba will cut it into 3 fields, "personal”, “after tax”, and “monthly income” by default. If you add "personal monthly income after tax” to In jieba's custom dictionary, jieba will not segment it.
  • the similarity between the target field and each other field in the field data table except the target field is calculated respectively. For example, in the example in Table 1 above, for the target field "Occupation”, calculate the similarity between "Occupation” and "Name”, the similarity between "Occupation” and “Gender”, and the similarity between "Occupation” and “Age”. , The similarity between "occupation” and “personal monthly income after tax” and the similarity between "occupation” and "industry”.
  • the similarity between the target field and any other field in the field data table can be calculated by the following steps:
  • the similarity index is a parameter used to measure the degree of similarity between two fields
  • calculating a similarity index between the target field and the any other field may include: calculating a string similarity index, a string length similarity index, and a string length similarity index between the target field and the any other field.
  • the similarity index of the number of enumerated values and the similarity index of the length of the enumerated values are all important parameters that can be used to determine the degree of similarity between fields.
  • the string similarity index can be calculated using the following formula:
  • s 1 represents the string similarity index
  • sim represents the number of the same string in the two fields (that is, the target field and any other field)
  • short represents the shorter length of the two fields.
  • the length of the string of the field, long represents the length of the string of the longer field in the two fields
  • is a hyperparameter used to control the impact of the string on the similarity.
  • the string length similarity index can be calculated using the following formula:
  • s 2 represents the string length similarity index
  • short represents the string length of the shorter field in the two fields (that is, the target field and any other field)
  • long represents two fields
  • the length of the string in the longer fields such as calculating the s 2 of the fields "personal monthly income after tax” and "occupation", we get
  • the similarity index of the number of enumerated values can be calculated using the following formula:
  • s 3 represents the similarity index of the number of enumerated values
  • min represents the number of enumeration values in the field with a small number of enumeration values in the two fields
  • max represents the number of enumeration values in the two fields is large.
  • the length similarity index of the enumerated values can be calculated using the following formula:
  • s 4 represents the length similarity index of the enumeration value
  • avg_min represents the average length of the enumeration value of the field with the shorter average length of the enumeration value in the two fields
  • avg_max represents the longer average length of the enumeration value in the two fields
  • the calculating the similarity between the target field and the any other field according to the similarity index between the target field and the any other field may include:
  • the target field in a question sentence is replaced to obtain a recommended second question sentence.
  • the first question sentence is "How is the average income distribution of different occupations in Shanghai?", where "Occupation” is a target field, and the field with the highest similarity to the field "Occupation” in the data table of this field is "Industry”, then You can replace the "occupation” in the first question sentence with "industry” to get the second question sentence: "How is the average income distribution in different industries in Shanghai?" Finally, recommend the second question sentence to the user to complete the process of a question recommendation.
  • each field will be compared with the fields in the pre-built field data table one by one to find the extracted fields and the same fields in the field data table. , Determine the target field; then, calculate the similarity between the target field and each other field in the field data table, find the field with the highest similarity, replace the target field in the question sentence, and get the recommendation Question.
  • the embodiment of the application comprehensively considers the similarity between each preset field, and replaces the field in the original question sentence with the field with the highest similarity. Generate new question sentences that are more in line with user expectations and improve the accuracy of the intelligent question answering system to recommend questions.
  • a second embodiment of a problem recommendation method based on field similarity calculation in an embodiment of the present application includes:
  • Steps 201-203 are the same as steps 101-103. For details, please refer to the relevant descriptions of steps 101-103.
  • the server may obtain the historical question record of the user who input the first question sentence, and search for all historical question sentences of the user.
  • the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user;
  • a co-occurrence matrix is constructed according to the historical question sentence, and the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user.
  • a certain co-occurrence matrix M constructed based on the user's historical questioning sentence is:
  • the co-occurrence matrix M corresponds to the following Table 2:
  • the value corresponding to "gender” and "occupation” is 18, which means that among all the historical question sentences of the user, the number of times that "gender” and “occupation” co-occur in the same historical question sentence is 18 .
  • pre-store all question sentences that users have asked such as "relationship between different genders and occupations", “proportion of unmarried people in different occupations and genders”, ..., “correlation between different genders and different occupations”, etc.
  • the similarity between the target field and each other field in the field data table except the target field can be calculated according to the co-occurrence matrix.
  • step 206 may include:
  • each field corresponds to a field vector.
  • the field vector for "occupation” is [0,18,27,22,3]
  • the field vector for gender is [18,0,2,15, 5], that is, extract the row or column of a field from the co-occurrence matrix, which is the field vector of the field.
  • the cosine similarity between the field vector of the target field and the field vector of each of the other fields is calculated separately, that is, the similarity between the target field and each of the other fields is obtained .
  • the target field is "occupation”
  • the similarity between it and some other field "gender” is equal to the vector [0,18,27,22,3] and the vector [18,0,2,15,5]
  • Step 207 is the same as step 105.
  • Step 207 is the same as step 105.
  • each field will be compared with the fields in the pre-built field data table one by one to find the extracted fields and the same fields in the field data table. , Determined as the target field; then, search for all historical question sentences input by the user and construct a co-occurrence matrix, and calculate the distance between the target field and each other field in the field data table except the target field according to the co-occurrence matrix Find out the field with the highest similarity and replace the target field in the question sentence to obtain the recommended question sentence.
  • this embodiment proposes a specific method for calculating the similarity between the target field and each other field.
  • a third embodiment of a problem recommendation method based on field similarity calculation in an embodiment of the present application includes:
  • the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user;
  • Steps 301-305 are the same as steps 201-205. For details, please refer to the relevant descriptions of steps 201-205.
  • the field in the field data table that co-occurs with the target field in the same historical question sentence of the user can be determined according to the co-occurrence matrix, and then the field is selected.
  • the field with the most times replaces the target field in the first question sentence to obtain the recommended third question sentence.
  • the first question sentence is "How is the average income distribution of different occupations in Shanghai", where "occupation” is a target field, and in the co-occurrence matrix M, the field with the most co-occurrences with the field "occupation” is "age” "(27 times), then you can replace the "occupation” in the first question sentence with "age” to get the third question sentence: "How is the average income distribution of different ages in Shanghai?"
  • each field will be compared with the fields in the pre-built field data table one by one to find the extracted fields and the same fields in the field data table. , Determine it as the target field; then, search for all historical question sentences input by the user and construct a co-occurrence matrix; determine according to the co-occurrence matrix that the field data table and the target field co-occur in the same history of the user The field with the most number of times in the question sentence is selected, and the field with the most times is selected, and the target field in the first question sentence is replaced to obtain the recommended third question sentence.
  • this embodiment proposes a question sentence generation method that also uses the co-occurrence matrix, but is different from calculating the similarity between fields.
  • FIG. 4 shows a structural block diagram of a question recommendation device based on field similarity calculation provided by an embodiment of the present application. For ease of description, only The parts related to the embodiments of the present application are shown.
  • the device includes:
  • the question obtaining module 401 is used to obtain the input first question sentence
  • the word segmentation module 402 is configured to perform word segmentation processing on the first question sentence and extract various fields contained therein;
  • the field comparison module 403 is configured to compare each field one by one with the fields in the pre-built field data table, find out the same fields that the various fields and the field data table have, and determine them as target fields;
  • the field similarity calculation module 404 is configured to calculate the similarity between the target field and each other field in the field data table except the target field;
  • the question recommendation module 405 is configured to select the field with the highest similarity among the other fields, replace the target field in the first question sentence, and obtain a recommended second question sentence.
  • the field similarity calculation module may include:
  • the similarity index calculation unit is used to combine the string and enumeration value of the target field, and the string and enumeration value of any other field to calculate the similarity between the target field and any other field A degree index, where the similarity index is a parameter used to measure the degree of similarity between two fields;
  • the first field similarity calculation unit is configured to calculate the similarity between the target field and the any other field according to the similarity index between the target field and the any other field.
  • the similarity index calculation unit may be specifically used to calculate a string similarity index, a string length similarity index, a number similarity index of enumerated values, and a string similarity index between the target field and any other field.
  • Enumeration length similarity index a string similarity index, a string length similarity index, a number similarity index of enumerated values, and a string similarity index between the target field and any other field.
  • the first field similarity calculation unit may be specifically used to calculate the string similarity index, the string length similarity index, the number of enumerated values similarity index, and the length of the enumerated values are similar.
  • the average or weighted average of the degree indicators is used as the similarity between the target field and any other field.
  • string similarity index can be calculated using the following formula:
  • s 1 represents the string similarity index
  • sim represents the number of identical strings in the two fields
  • short represents the length of the string in the shorter field of the two fields
  • long represents the length of the string in the two fields.
  • the length of the string of the longer field, ⁇ is a hyperparameter used to control the impact of the string on the similarity;
  • the string length similarity index can be calculated using the following formula:
  • s 2 represents the string length similarity index, short represents the string length of the shorter field of the two fields, and long represents the string length of the longer field of the two fields;
  • the similarity index of the number of enumerated values can be calculated using the following formula:
  • s 3 represents the similarity index of the number of enumerated values
  • min represents the number of enumeration values in the field with a small number of enumeration values in the two fields
  • max represents the number of enumeration values in the two fields is large. The number of enumeration values that the field has;
  • the length similarity index of the enumerated values can be calculated using the following formula:
  • s 4 represents the length similarity index of the enumeration value
  • avg_min represents the average length of the enumeration value of the field with the shorter average length of the enumeration value in the two fields
  • avg_max represents the longer average length of the enumeration value in the two fields The average length of the enumeration value of the field.
  • the field similarity calculation module may include:
  • the historical sentence search unit is used to search for all historical question sentences of the user who input the first question sentence
  • the co-occurrence matrix construction unit is configured to construct a co-occurrence matrix according to the historical question sentence, the co-occurrence matrix records the number of times any two fields in the field data table appear together in the same historical question sentence of the user ;
  • the second field similarity calculation unit is configured to calculate the similarity between the target field and the other fields according to the co-occurrence matrix.
  • the second field similarity calculation unit may include:
  • the field vector extraction subunit is used to extract the field vector of the target field and the field vector of each of the other fields from the co-occurrence matrix.
  • Each element of the field vector is the corresponding field and the field vector. The number of times that each field in the field data table appears together in the same historical question sentence of the user;
  • the cosine similarity calculation subunit is used to calculate the cosine similarity between the field vector of the target field and the field vector of each of the other fields to obtain the similarity between the target field and each of the other fields. Spend.
  • the field similarity calculation module may further include:
  • a field determination unit with the highest frequency configured to determine, according to the co-occurrence matrix, a field in the field data table that co-occurs with the target field in the same historical question sentence of the user the most frequently;
  • the field replacement module is used to select the field with the most frequency and replace the target field in the first question sentence to obtain the recommended third question sentence.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, any one of those shown in FIGS. 1 to 3 is implemented.
  • the computer-readable storage medium may be non-volatile or volatile.
  • An embodiment of the present application further provides a server, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer-readable instructions when the computer-readable instructions are executed.
  • Figures 1 to 3 show the steps of any question recommendation method based on field similarity calculation.
  • the embodiment of the present application also provides a computer program product, when the computer program product runs on a server, the server executes the steps of implementing any problem recommendation method based on field similarity calculation as shown in Figs. 1 to 3.
  • Fig. 5 is a schematic diagram of a server provided by an embodiment of the present application.
  • the server 5 of this embodiment includes a processor 50, a memory 51, and computer-readable instructions 52 stored in the memory 51 and running on the processor 50.
  • the processor 50 executes the computer-readable instructions 52
  • the steps in the above-mentioned problem recommendation method embodiments based on field similarity calculation, such as steps 101 to 105 shown in FIG. 1 are implemented.
  • the processor 50 executes the computer-readable instructions 52
  • the functions of the modules/units in the foregoing device embodiments, such as the functions of the modules 401 to 405 shown in FIG. 4, are implemented.
  • the computer-readable instructions 52 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 51 and executed by the processor 50, To complete this application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 52 in the server 5.
  • the server 5 may be a computing device such as a smart phone, a notebook, a palmtop computer, and a cloud server.
  • the server 5 may include, but is not limited to, a processor 50 and a memory 51.
  • FIG. 5 is only an example of the server 5, and does not constitute a limitation on the server 5. It may include more or less components than those shown in the figure, or a combination of certain components, or different components, such as
  • the server 5 may also include input and output devices, network access devices, buses, and the like.
  • the processor 50 may be a central processing unit (CentraL Processing Unit, CPU), or other general-purpose processors, digital signal processors (DigitaL Signal Processor, DSP), application specific integrated circuits (AppLication Specific Integrated Circuit, ASIC), Ready-made programmable gate array (FieLd-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the storage 51 may be an internal storage unit of the server 5, such as a hard disk or a memory of the server 5.
  • the memory 51 may also be an external storage device of the server 5, such as a plug-in hard disk, a smart media card (SMC), or a secure digital (SD) card equipped on the server 5. Flash Card (FLash Card), etc.
  • the storage 51 may also include both an internal storage unit of the server 5 and an external storage device.
  • the memory 51 is used to store the computer readable instructions and other programs and data required by the server.
  • the memory 51 can also be used to temporarily store data that has been output or will be output.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • ROM read-only memory
  • RAM random access memory
  • electric carrier signal telecommunications signal and software distribution medium.
  • U disk mobile hard disk, floppy disk or CD-ROM, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A question recommendation method and apparatus based on the field similarity calculation, and a server, which are suitable for the technical field of artificial intelligence. The question recommendation method comprises: obtaining an input first questioning statement (101); performing word segmentation processing on the first questioning statement to extract fields comprised therein (102); respectively comparing the fields with fields comprised in a pre-constructed field data table to find out the same field in the fields and the field data table, and determining the field as a target field (103); respectively calculating the similarity between the target field and each of other fields in the field data table other than the target field (104); and selecting a field having the highest similarity from the other fields, and replacing the target field in the first questioning statement with the field to obtain a recommended second questioning statement (105). By using the question recommendation method, a new question sentence more conforming to the expectation of a user can be generated, and the accuracy of question recommendation of an intelligent question answering system is improved.

Description

一种基于字段相似度计算的问题推荐方法、装置和服务器Problem recommendation method, device and server based on field similarity calculation
本申请要求于2020年4月2日提交中国专利局、申请号为202010255040.6、申请名称为“一种基于字段相似度计算的问题推荐方法、装置和服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 2, 2020, the application number is 202010255040.6, and the application name is "A method, device and server for recommending a problem based on field similarity calculation", all of which The content is incorporated in this application by reference.
技术领域Technical field
本申请属于人工智能技术领域,尤其涉及一种基于字段相似度计算的问题推荐方法、装置、存储介质和服务器。This application belongs to the field of artificial intelligence technology, and in particular relates to a problem recommendation method, device, storage medium, and server based on field similarity calculation.
背景技术Background technique
基于自然语言的智能问答系统的工作原理通常是,用户输入一条问句,智能问答系统对该问句进行自然语言处理,生成结构化查询语言,再根据该结构化查询语言到数据库或知识库中查找答复的内容,最后将查询结果返回给用户。The working principle of an intelligent question answering system based on natural language is usually that the user enters a question sentence, the intelligent question answering system performs natural language processing on the question sentence, generates a structured query language, and then transfers the structured query language to the database or knowledge base according to the structured query language. Find the content of the reply, and finally return the query result to the user.
目前,智能问答系统的问题推荐方式主要有两种,一种是实时推荐,即根据用户当前输入的问句进行推荐;另外一种是相似问题推荐。在实时推荐时,往往是基于关键字触发,例如当用户输入“by”时,会推荐某个枚举型字段名;而在相似问题推荐上,则是随机替换原问句中同类型的关键词,从而拼成新的问句。然而,发明人意识到上述两种方式推荐的问题往往与用户的预期相去甚远,问题推荐的精准度较低。At present, there are two main question recommendation methods for intelligent question answering systems. One is real-time recommendation, that is, recommendation is based on the question currently input by the user; the other is similar question recommendation. In real-time recommendation, it is often triggered based on keywords. For example, when the user enters "by", an enumerated field name will be recommended; while in the recommendation of similar questions, it is the key to randomly replace the same type in the original question Words, so as to spell a new question. However, the inventor realizes that the problems of the above two methods of recommendation are often far from the user's expectations, and the accuracy of the problem recommendation is low.
发明内容Summary of the invention
有鉴于此,本申请提出一种基于字段相似度计算的问题推荐方法、装置、存储介质和服务器,能够提高智能问答系统推荐问题的精准度。In view of this, this application proposes a question recommendation method, device, storage medium and server based on field similarity calculation, which can improve the accuracy of the question recommendation of the intelligent question answering system.
第一方面,本申请实施例提供了一种基于字段相似度计算的问题推荐方法,包括:In the first aspect, an embodiment of the present application provides a method for problem recommendation based on field similarity calculation, including:
获取输入的第一提问语句;Obtain the input first question sentence;
对所述第一提问语句进行分词处理,提取其中包含的各个字段;Perform word segmentation processing on the first question sentence, and extract various fields contained therein;
将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;Compare each of the fields with the fields in the pre-built field data table one by one, find out the same fields that the various fields and the field data table have, and determine them as the target field;
分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;Respectively calculating the similarity between the target field and each other field in the field data table except the target field;
选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。The field with the highest similarity among the other fields is selected, and the target field in the first question sentence is replaced to obtain a recommended second question sentence.
第二方面,本申请实施例提供了一种基于字段相似度计算的问题推荐装置,包括:In the second aspect, an embodiment of the present application provides a question recommendation device based on field similarity calculation, including:
问题获取模块,用于获取输入的第一提问语句;The question acquisition module is used to acquire the input first question sentence;
分词模块,用于对所述第一提问语句进行分词处理,提取其中包含的各个字段;The word segmentation module is used to perform word segmentation processing on the first question sentence and extract each field contained therein;
字段比较模块,用于将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;A field comparison module, which is used to compare each field one by one with the fields in the pre-built field data table, find out the same fields that each field and the field data table have, and determine it as a target field;
字段相似度计算模块,用于分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;A field similarity calculation module, configured to calculate the similarity between the target field and each other field in the field data table except the target field;
问题推荐模块,用于选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。The question recommendation module is configured to select the field with the highest similarity among the other fields, replace the target field in the first question sentence, and obtain a recommended second question sentence.
第三方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如本申请实施例第一方面提出的问题推荐方法的步骤。In the third aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it implements what is proposed in the first aspect of the embodiments of the present application. The steps of the problem recommendation method.
第四方面,本申请实施例提供了一种服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如本申请实施例第一方面提出的问题推荐方法的步骤。In a fourth aspect, an embodiment of the present application provides a server, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor executes the computer program when the computer program is executed. Such as the steps of the problem recommendation method proposed in the first aspect of the embodiment of the present application.
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面所述的问题推荐方法的步骤。In the fifth aspect, the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the steps of the problem recommendation method described in the first aspect.
本申请提出的基于字段相似度计算的问题推荐方法,在提取到输入的提问语句的各个字段之后,会将各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出提取出的字段和字段数据表中具有的相同字段,确定为目标字段;然后,分别计算该目标字段与该字段数据表中各个其它字段之间的相似度,找出相似度最高的字段,对该提问语句中的目标字段进行替换,从而得到推荐的问句。与常规的随机替换语句中同类型关键词的方式相比,本申请综合考虑了各个预设字段之间的相似度,用相似度最高的字段对原提问语句中的字段进行替换,能够生成更符合用户预期的新问句,提高智能问答系统推荐问题的精准度。The question recommendation method based on field similarity calculation proposed in this application, after extracting each field of the input question sentence, each field will be compared with the fields in the pre-built field data table one by one to find out the extracted The field and the same field in the field data table are determined as the target field; then, the similarity between the target field and each other field in the field data table is calculated separately, and the field with the highest similarity is found, and the question statement Replace the target field in to get the recommended question. Compared with the conventional method of randomly replacing the same type of keywords in the sentence, this application comprehensively considers the similarity between each preset field, and replaces the field in the original question sentence with the field with the highest similarity, which can generate more New question sentences that meet user expectations and improve the accuracy of the intelligent question answering system's recommended questions.
附图说明Description of the drawings
图1是本申请实施例提供的一种问题推荐方法的第一个实施例的流程图;FIG. 1 is a flowchart of a first embodiment of a problem recommendation method provided by an embodiment of the present application;
图2是本申请实施例提供的一种问题推荐方法的第二个实施例的流程图;FIG. 2 is a flowchart of a second embodiment of a question recommendation method provided by an embodiment of the present application;
图3是本申请实施例提供的一种问题推荐方法的第三个实施例的流程图;FIG. 3 is a flowchart of a third embodiment of a problem recommendation method provided by an embodiment of the present application;
图4是本申请实施例提供的一种问题推荐装置的一个实施例的结构图;FIG. 4 is a structural diagram of an embodiment of a problem recommendation device provided by an embodiment of the present application;
图5是本申请实施例提供的一种服务器的示意图。Fig. 5 is a schematic diagram of a server provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application. In addition, in the description of the specification of this application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.
本申请提出一种问题推荐方法、装置、存储介质和服务器,能够提高智能问答系统推荐问题的精准度。This application proposes a question recommendation method, device, storage medium, and server, which can improve the accuracy of the question recommendation by the intelligent question answering system.
应当理解,本申请各个实施例提出的基于字段相似度计算的问题推荐方法的执行主体是各种类型的服务器或者终端设备。It should be understood that the subject of the question recommendation method based on field similarity calculation proposed in the various embodiments of the present application is various types of servers or terminal devices.
请参阅图1,本申请实施例中一种基于字段相似度计算的问题推荐方法的第一个实施例包括:Referring to FIG. 1, a first embodiment of a method for recommending a question based on field similarity calculation in an embodiment of the present application includes:
101、获取输入的第一提问语句;101. Obtain the input first question sentence;
用户可以在终端设备上通过语音输入或者手动输入要提问的问题,即该第一提问语句,该提问语句会发送至服务器端的智能问答系统。The user can input the question to be asked by voice input or manually on the terminal device, that is, the first question sentence, and the question sentence will be sent to the intelligent question answering system on the server side.
102、对所述第一提问语句进行分词处理,提取其中包含的各个字段;102. Perform word segmentation processing on the first question sentence, and extract various fields contained therein;
服务器在获取到该提问语句之后,会对该提问语句进行分词,提取其包含的各个字段。在分词的时候,可以采用现有技术中各种不同类型的分词方式,比如可以采用jieba分词,假如用户提出的问题为:“男性不同职业平均年龄如何?”,则在使用jieba分词之后,会得到字段list[“男性”,“不同”,“职业”,“平均”,“年龄”,“如何”,“?”]。After the server obtains the question sentence, it will segment the question sentence and extract the various fields it contains. In word segmentation, various different types of word segmentation methods in the prior art can be used. For example, jieba word segmentation can be used. If the user asks: "What is the average age of men in different occupations?", after using jieba word segmentation, Get the field list["male", "different", "occupation", "average", "age", "how", "?"].
103、将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;103. Compare the fields one by one with the fields in the pre-built field data table, find out the same fields that the fields and the field data table have, and determine them as target fields;
在分词得到该第一提问语句中的各个字段之后,服务器会将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段。After the word segmentation obtains each field in the first question sentence, the server compares each field with the fields in the pre-built field data table one by one, and finds out the various fields and the fields in the field data table. The same field is determined as the target field.
预先构建的字段数据表可以如以下的表1所示:The pre-built field data table can be as shown in Table 1 below:
表1Table 1
姓名Name 职业profession 性别gender 年龄age 个人税后月收入Personal monthly income after tax 行业industry
张三Zhang San 警察police male 3535 45004500 安保security
李四Li Si 服务员waiter Female 2929 40004000 服务service
在表1中,“姓名”,“职业”,“性别”,“年龄”,“个人税后月收入”,“行业”都是该字段数据表具有的字段,“张三”,“李四”,“服务员”,“警察”,“男”,“女”,“安保”,“服务”等都是字段的枚举值。在构建字段数据表时,将上述字段和枚举值,写进数据结构中,例如在python语言中,可以用dict类型,存储上述数据,形成dict类型的数据结构表格。In Table 1, "Name", "Occupation", "Gender", "Age", "Individual Monthly Income After Tax", and "Industry" are all fields in the data table of this field, "Zhang San", "Li Si" ", "waiter", "police", "male", "female", "security", "service", etc. are all enumerated values of fields. When constructing the field data table, write the above fields and enumeration values into the data structure. For example, in the python language, you can use the dict type to store the above data to form a dict type data structure table.
另外,可以将这些字段加入到 jieba的自定义词典中,这样,在对用户输入的问句进行分词时,就不会将这些字段关键词切开。例如,对于字段关键词“个人税后月收入”,jieba默认会将其切成“个人”,“税后”,“月收入”3个字段,而如果将“个人税后月收入”加入到jieba的自定义词典中,jieba就不会对其进行切分。 In addition, these fields can be added to jieba's custom dictionary , so that the keywords in these fields will not be cut when the question sentence entered by the user is segmented. For example, for the field keyword "personal monthly income after tax", jieba will cut it into 3 fields, "personal", "after tax", and "monthly income" by default. If you add "personal monthly income after tax" to In jieba's custom dictionary, jieba will not segment it.
假设所述各个字段为list[“男性”,“不同”,“职业”,“平均”,“年龄”,“如何”,“?”],将这些字段与表1中的各个字段进行比较,找出相同的字段为“职业”和“年龄”,作为目标字段。需要说明的是,这里的目标字段可以为一个,也可以为多个。Assuming that the various fields are list["male", "different", "occupation", "average", "age", "how", "?"], compare these fields with the fields in Table 1, Find out the same fields as "Occupation" and "Age" as the target field. It should be noted that there can be one or more target fields here.
104、分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;104. Calculate the similarity between the target field and each other field in the field data table except the target field respectively;
在确定目标字段之后,分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度。比如在上述表1的例子中,对于目标字段“职业”,则计算“职业”与“姓名”的相似度、“职业”与“性别”的相似度、“职业”与“年龄”的相似度、“职业”与“个人税后月收入”的相似度以及“职业”与“行业”的相似度。After the target field is determined, the similarity between the target field and each other field in the field data table except the target field is calculated respectively. For example, in the example in Table 1 above, for the target field "Occupation", calculate the similarity between "Occupation" and "Name", the similarity between "Occupation" and "Gender", and the similarity between "Occupation" and "Age". , The similarity between "occupation" and "personal monthly income after tax" and the similarity between "occupation" and "industry".
进一步的,所述目标字段与所述字段数据表中任意一个其它字段之间的相似度可以通过以下步骤计算:Further, the similarity between the target field and any other field in the field data table can be calculated by the following steps:
(1)结合所述目标字段的字符串和枚举值,以及所述任意一个其它字段的字符串和枚举值,计算所述目标字段和所述任意一个其它字段的相似度指标,所述相似度指标为用于衡量两个字段之间的相似程度的参数;(1) Combining the character string and enumerated value of the target field, and the character string and enumerated value of any other field to calculate the similarity index between the target field and any other field, The similarity index is a parameter used to measure the degree of similarity between two fields;
(2)根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度。(2) According to the similarity index of the target field and the any other field, the similarity between the target field and the any other field is calculated.
字符串和枚举值的相关属性参数,比如字符串的长度,或者枚举值的数量和类别,都是可以用于确定字段之间相似程度的重要参数。进一步的,所述计算所述目标字段和所述任意一个其它字段的相似度指标可以包括:计算所述目标字段和所述任意一个其它字段的字符串相似度指标、字符串长度相似度指标、枚举值个数相似度指标以及枚举值长度相似度指标。Related attribute parameters of strings and enumerated values, such as the length of the string, or the number and category of enumerated values, are all important parameters that can be used to determine the degree of similarity between fields. Further, the calculating a similarity index between the target field and the any other field may include: calculating a string similarity index, a string length similarity index, and a string length similarity index between the target field and the any other field. The similarity index of the number of enumerated values and the similarity index of the length of the enumerated values.
具体的,所述字符串相似度指标可以采用以下公式计算:Specifically, the string similarity index can be calculated using the following formula:
Figure PCTCN2021078031-appb-000001
Figure PCTCN2021078031-appb-000001
其中,s 1表示所述字符串相似度指标,sim表示两个字段(即所述目标字段和所述任意一个其它字段)具有的相同字符串的个数,short表示两个字段中长度较短的字 段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度,α是一个超参数,用于控制字符串对相似度的影响。
Figure PCTCN2021078031-appb-000002
的作用是将s 1压缩在0和1之间。例如,有两个字段,分别为“个人税后月收入”和“个人所得税”,那么在计算两者的s 1时,sim=3(“个”、“人”、“税”),short=5,long=7。
Wherein, s 1 represents the string similarity index, sim represents the number of the same string in the two fields (that is, the target field and any other field), and short represents the shorter length of the two fields. The length of the string of the field, long represents the length of the string of the longer field in the two fields, and α is a hyperparameter used to control the impact of the string on the similarity.
Figure PCTCN2021078031-appb-000002
The function of is to compress s 1 between 0 and 1. For example, if there are two fields, namely "personal monthly income after tax" and "personal income tax", when calculating s 1 of both, sim = 3 ("person", "person", "tax"), short =5, long=7.
所述字符串长度相似度指标可以采用以下公式计算:The string length similarity index can be calculated using the following formula:
Figure PCTCN2021078031-appb-000003
Figure PCTCN2021078031-appb-000003
其中,s 2表示所述字符串长度相似度指标,short表示两个字段(即所述目标字段和所述任意一个其它字段)中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度,例如计算字段“个人税后月收入”和“职业”的s 2,得到
Figure PCTCN2021078031-appb-000004
Wherein, s 2 represents the string length similarity index, short represents the string length of the shorter field in the two fields (that is, the target field and any other field), and long represents two fields The length of the string in the longer fields, such as calculating the s 2 of the fields "personal monthly income after tax" and "occupation", we get
Figure PCTCN2021078031-appb-000004
所述枚举值个数相似度指标可以采用以下公式计算:The similarity index of the number of enumerated values can be calculated using the following formula:
Figure PCTCN2021078031-appb-000005
Figure PCTCN2021078031-appb-000005
其中,s 3表示所述枚举值个数相似度指标,min表示两个字段中枚举值数量较少的字段具有的枚举值个数,max表示两个字段中枚举值数量较多的字段具有的枚举值个数。例如,字段数据表中“职业”字段的枚举值有6个(警察、护士、教师、程序员、学生、职员),“性别”字段的枚举值有2个(男和女),则两者的s 3
Figure PCTCN2021078031-appb-000006
Among them, s 3 represents the similarity index of the number of enumerated values, min represents the number of enumeration values in the field with a small number of enumeration values in the two fields, and max represents the number of enumeration values in the two fields is large. The number of enumeration values that the field has. For example, there are 6 enumeration values for the "Occupation" field in the field data table (police, nurse, teacher, programmer, student, and staff), and there are 2 enumeration values for the "Gender" field (male and female), then The s 3 of the two is
Figure PCTCN2021078031-appb-000006
所述枚举值长度相似度指标可以采用以下公式计算:The length similarity index of the enumerated values can be calculated using the following formula:
Figure PCTCN2021078031-appb-000007
Figure PCTCN2021078031-appb-000007
其中,s 4表示所述枚举值长度相似度指标,avg_min表示两个字段中枚举值平均长度较短的字段的枚举值平均长度,avg_max表示两个字段中枚举值平均长度较长的字段的枚举值平均长度。例如“职业”字段的枚举值平均长度为(2+2+2+3+2+2)/6=2.17,“性别”字段的枚举值平均长度为(1+1)/2=1,则两者的s 4
Figure PCTCN2021078031-appb-000008
Among them, s 4 represents the length similarity index of the enumeration value, avg_min represents the average length of the enumeration value of the field with the shorter average length of the enumeration value in the two fields, and avg_max represents the longer average length of the enumeration value in the two fields The average length of the enumeration value of the field. For example, the average length of the enumeration value of the "Occupation" field is (2+2+2+3+2+2)/6=2.17, and the average length of the enumeration value of the "Gender" field is (1+1)/2=1 , Then the s 4 of the two is
Figure PCTCN2021078031-appb-000008
具体的,所述根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度可以包括:Specifically, the calculating the similarity between the target field and the any other field according to the similarity index between the target field and the any other field may include:
计算所述字符串相似度指标、所述字符串长度相似度指标、所述枚举值个数相似度指标以及所述枚举值长度相似度指标的平均值或者加权平均值,作为所述目标字段和所述任意一个其它字段的相似度,比如两个字段的相似度
Figure PCTCN2021078031-appb-000009
Calculate the average or weighted average of the string similarity index, the string length similarity index, the number similarity index of the enumeration value, and the length similarity index of the enumeration value, as the target The similarity between the field and any of the other fields, such as the similarity between two fields
Figure PCTCN2021078031-appb-000009
105、选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。105. Select the field with the highest similarity among the other fields, and replace the target field in the first question sentence to obtain a recommended second question sentence.
在计算得到所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度之后,选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。例如,第一提问语句为“上海不同职业的平均收入分布如何”,其中“职业”是一个目标字段,该字段数据表中与“职业”这个字段相似度最高的字段为“行业”,那么就可以用“行业”替换该第一提问语句中的“职业”,从而得到第二提问语句:“上海不同行业的平均收入分布如何”。最后,将该第二提问语句推荐给用户,完成一次问题推荐的过程。After calculating the similarity between the target field and each other field in the field data table except the target field, select the field with the highest similarity among the other fields, and compare the second The target field in a question sentence is replaced to obtain a recommended second question sentence. For example, the first question sentence is "How is the average income distribution of different occupations in Shanghai?", where "Occupation" is a target field, and the field with the highest similarity to the field "Occupation" in the data table of this field is "Industry", then You can replace the "occupation" in the first question sentence with "industry" to get the second question sentence: "How is the average income distribution in different industries in Shanghai?" Finally, recommend the second question sentence to the user to complete the process of a question recommendation.
本申请实施例在提取到输入的提问语句的各个字段之后,会将各个字段逐一与预 先构建的字段数据表中具有的字段进行比较,找出提取出的字段和字段数据表中具有的相同字段,确定为目标字段;然后,分别计算该目标字段与该字段数据表中各个其它字段之间的相似度,找出相似度最高的字段,对该提问语句中的目标字段进行替换,从而得到推荐的问句。与常规的随机替换语句中同类型关键词的方式相比,本申请实施例综合考虑了各个预设字段之间的相似度,用相似度最高的字段对原提问语句中的字段进行替换,能够生成更符合用户预期的新问句,提高智能问答系统推荐问题的精准度。After extracting each field of the input question sentence in the embodiment of the application, each field will be compared with the fields in the pre-built field data table one by one to find the extracted fields and the same fields in the field data table. , Determine the target field; then, calculate the similarity between the target field and each other field in the field data table, find the field with the highest similarity, replace the target field in the question sentence, and get the recommendation Question. Compared with the conventional method of randomly replacing the same type of keywords in the sentence, the embodiment of the application comprehensively considers the similarity between each preset field, and replaces the field in the original question sentence with the field with the highest similarity. Generate new question sentences that are more in line with user expectations and improve the accuracy of the intelligent question answering system to recommend questions.
请参阅图2,本申请实施例中一种基于字段相似度计算的问题推荐方法的第二个实施例包括:Referring to FIG. 2, a second embodiment of a problem recommendation method based on field similarity calculation in an embodiment of the present application includes:
201、获取输入的第一提问语句;201. Obtain the input first question sentence;
202、对所述第一提问语句进行分词处理,提取其中包含的各个字段;202. Perform word segmentation processing on the first question sentence, and extract various fields contained therein;
203、将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;203. Compare the respective fields one by one with the fields in the pre-built field data table, find out the same fields that the respective fields and the field data table have, and determine them as target fields;
步骤201-203与步骤101-103相同,具体可参照步骤101-103的相关说明。Steps 201-203 are the same as steps 101-103. For details, please refer to the relevant descriptions of steps 101-103.
204、查找输入所述第一提问语句的用户的所有历史提问语句;204. Search for all historical question sentences of the user who input the first question sentence;
在确定该目标字段之后,服务器可以获取输入所述第一提问语句的用户的历史提问记录,查找该用户的所有历史提问语句。After determining the target field, the server may obtain the historical question record of the user who input the first question sentence, and search for all historical question sentences of the user.
205、根据所述历史提问语句构建共现矩阵,所述共现矩阵记录所述字段数据表中任意两个字段共同出现于所述用户的同一条历史提问语句中的次数;205. Construct a co-occurrence matrix according to the historical question sentence, the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user;
然后,根据所述历史提问语句构建共现矩阵,所述共现矩阵记录所述字段数据表中任意两个字段共同出现于所述用户的同一条历史提问语句中的次数。比如,根据用户的历史提问语句构建的某个共现矩阵M为:Then, a co-occurrence matrix is constructed according to the historical question sentence, and the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user. For example, a certain co-occurrence matrix M constructed based on the user's historical questioning sentence is:
Figure PCTCN2021078031-appb-000010
Figure PCTCN2021078031-appb-000010
该共现矩阵M对应于以下的表2:The co-occurrence matrix M corresponds to the following Table 2:
表2Table 2
Figure PCTCN2021078031-appb-000011
Figure PCTCN2021078031-appb-000011
在表2中,“性别”和“职业”所对应的值为18,表示在该用户的所有历史提问语句中,“性别”和“职业”在同一条历史提问语句中共现过的次数为18。比如,预先存储用户提问过的所有提问语句,“不同性别和职业之间的关系”、“不同职业和性别未婚比例”、…、“不同性别和不同职业之间的相关性”等。在这些问句中,都有“职业”和“性别”,如果这样的提问语句有18个,那么“职业”和“性别”这两个就是共现了18次。In Table 2, the value corresponding to "gender" and "occupation" is 18, which means that among all the historical question sentences of the user, the number of times that "gender" and "occupation" co-occur in the same historical question sentence is 18 . For example, pre-store all question sentences that users have asked, such as "relationship between different genders and occupations", "proportion of unmarried people in different occupations and genders", ..., "correlation between different genders and different occupations", etc. In these question sentences, there are both "occupation" and "gender". If there are 18 such question sentences, then the two "occupation" and "gender" appear together 18 times.
206、根据所述共现矩阵计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;206. Calculate the similarity between the target field and each other field in the field data table except the target field according to the co-occurrence matrix;
在构建出共现矩阵之后,可以根据所述共现矩阵计算所述目标字段与所述字段数据表 中除所述目标字段外的各个其它字段之间的相似度。After the co-occurrence matrix is constructed, the similarity between the target field and each other field in the field data table except the target field can be calculated according to the co-occurrence matrix.
具体的,步骤206可以包括:Specifically, step 206 may include:
(1)从所述共现矩阵中分别提取所述目标字段的字段向量以及每个所述其它字段的字段向量,所述字段向量的各个元素分别为相应的字段与所述字段数据表中的各个字段共同出现于所述用户的同一条历史提问语句中的次数;(1) Extract the field vector of the target field and the field vector of each of the other fields from the co-occurrence matrix. Each element of the field vector is the corresponding field and the field vector in the field data table. The number of times that each field appears in the same historical question sentence of the user;
(2)分别计算所述目标字段的字段向量和每个所述其它字段的字段向量之间的余弦相似度,得到所述目标字段与所述各个其它字段之间的相似度。(2) Calculate the cosine similarity between the field vector of the target field and the field vector of each of the other fields, respectively, to obtain the similarity between the target field and each of the other fields.
在该共现矩阵中,每一个字段对应于一个字段向量,例如“职业”的字段向量为[0,18,27,22,3],性别的字段向量为[18,0,2,15,5],也即从该共现矩阵中取出某个字段所在的行或列,就是该字段的字段向量。在提取出字段向量之后,分别计算所述目标字段的字段向量和每个所述其它字段的字段向量之间的余弦相似度,即得到所述目标字段与所述各个其它字段之间的相似度。比如,目标字段为“职业”,则其与某个其它字段“性别”之间的相似度等于向量[0,18,27,22,3]和向量[18,0,2,15,5]的余弦相似度。In the co-occurrence matrix, each field corresponds to a field vector. For example, the field vector for "occupation" is [0,18,27,22,3], and the field vector for gender is [18,0,2,15, 5], that is, extract the row or column of a field from the co-occurrence matrix, which is the field vector of the field. After the field vector is extracted, the cosine similarity between the field vector of the target field and the field vector of each of the other fields is calculated separately, that is, the similarity between the target field and each of the other fields is obtained . For example, if the target field is "occupation", the similarity between it and some other field "gender" is equal to the vector [0,18,27,22,3] and the vector [18,0,2,15,5] The cosine similarity of.
207、选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。207. Select the field with the highest similarity among the other fields, and replace the target field in the first question sentence to obtain a recommended second question sentence.
步骤207与步骤105相同,具体可参照步骤105的相关说明。Step 207 is the same as step 105. For details, please refer to the related description of step 105.
本申请实施例在提取到输入的提问语句的各个字段之后,会将各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出提取出的字段和字段数据表中具有的相同字段,确定为目标字段;然后,查找用户输入的所有历史提问语句并构建共现矩阵,根据共现矩阵计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度,找出相似度最高的字段,对该提问语句中的目标字段进行替换,从而得到推荐的问句。与本申请的第一个实施例相比,本实施例提出了一种计算目标字段与各个其它字段之间的相似度的具体方式。After extracting each field of the input question sentence in the embodiment of the application, each field will be compared with the fields in the pre-built field data table one by one to find the extracted fields and the same fields in the field data table. , Determined as the target field; then, search for all historical question sentences input by the user and construct a co-occurrence matrix, and calculate the distance between the target field and each other field in the field data table except the target field according to the co-occurrence matrix Find out the field with the highest similarity and replace the target field in the question sentence to obtain the recommended question sentence. Compared with the first embodiment of the present application, this embodiment proposes a specific method for calculating the similarity between the target field and each other field.
请参阅图3,本申请实施例中一种基于字段相似度计算的问题推荐方法的第三个实施例包括:Referring to FIG. 3, a third embodiment of a problem recommendation method based on field similarity calculation in an embodiment of the present application includes:
301、获取输入的第一提问语句;301. Obtain the input first question sentence;
302、对所述第一提问语句进行分词处理,提取其中包含的各个字段;302. Perform word segmentation processing on the first question sentence, and extract various fields contained therein;
303、将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;303. Compare the respective fields one by one with the fields in the pre-built field data table, find out the same fields that the respective fields and the field data table have, and determine them as target fields;
304、查找输入所述第一提问语句的用户的所有历史提问语句;304. Search for all historical question sentences of the user who input the first question sentence;
305、根据所述历史提问语句构建共现矩阵,所述共现矩阵记录所述字段数据表中任意两个字段共同出现于所述用户的同一条历史提问语句中的次数;305. Construct a co-occurrence matrix according to the historical question sentence, the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user;
步骤301-305与步骤201-205相同,具体可参照步骤201-205的相关说明。Steps 301-305 are the same as steps 201-205. For details, please refer to the relevant descriptions of steps 201-205.
306、根据所述共现矩阵确定所述字段数据表中与所述目标字段共同出现于所述用户的同一条历史提问语句中的次数最多的字段;306. Determine, according to the co-occurrence matrix, a field in the field data table that co-occurs with the target field in the same historical question sentence of the user the most frequently;
307、选取所述次数最多的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第三提问语句。307. Select the field with the largest number of times, and replace the target field in the first question sentence to obtain a recommended third question sentence.
在构建出共现矩阵之后,可以根据所述共现矩阵确定所述字段数据表中与所述目标字段共同出现于所述用户的同一条历史提问语句中的次数最多的字段,然后选取所述次数最多的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第三提问语句。After the co-occurrence matrix is constructed, the field in the field data table that co-occurs with the target field in the same historical question sentence of the user can be determined according to the co-occurrence matrix, and then the field is selected. The field with the most times replaces the target field in the first question sentence to obtain the recommended third question sentence.
例如,第一提问语句为“上海不同职业的平均收入分布如何”,其中“职业”是一个目标字段,在该共现矩阵M中,与字段“职业”的共现次数最多的字段为“年龄”(27次),那么就可以用“年龄”替换该第一提问语句中的“职业”,从而得到第三提问语句:“上海不同年龄的平均收入分布如何”。For example, the first question sentence is "How is the average income distribution of different occupations in Shanghai", where "occupation" is a target field, and in the co-occurrence matrix M, the field with the most co-occurrences with the field "occupation" is "age" "(27 times), then you can replace the "occupation" in the first question sentence with "age" to get the third question sentence: "How is the average income distribution of different ages in Shanghai?"
本申请实施例在提取到输入的提问语句的各个字段之后,会将各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出提取出的字段和字段数据表中具有的相同字段,确定为目标字段;然后,查找用户输入的所有历史提问语句并构建共现矩阵;根据所述共现矩阵确定所述字段数据表中与所述目标字段共同出现于所述用户的同一条历史提问语句中的次数最多的字段,选取所述次数最多的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第三提问语句。与本申请的第二个实施例相比,本实施例提出了一种同样使用该共现矩阵,但区别于计算字段之间相似度的提问语句生成方式。After extracting each field of the input question sentence in the embodiment of the application, each field will be compared with the fields in the pre-built field data table one by one to find the extracted fields and the same fields in the field data table. , Determine it as the target field; then, search for all historical question sentences input by the user and construct a co-occurrence matrix; determine according to the co-occurrence matrix that the field data table and the target field co-occur in the same history of the user The field with the most number of times in the question sentence is selected, and the field with the most times is selected, and the target field in the first question sentence is replaced to obtain the recommended third question sentence. Compared with the second embodiment of the present application, this embodiment proposes a question sentence generation method that also uses the co-occurrence matrix, but is different from calculating the similarity between fields.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
对应于上文实施例所述的基于字段相似度计算的问题推荐方法,图4示出了本申请实施例提供的一种基于字段相似度计算的问题推荐装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the question recommendation method based on field similarity calculation described in the above embodiment, FIG. 4 shows a structural block diagram of a question recommendation device based on field similarity calculation provided by an embodiment of the present application. For ease of description, only The parts related to the embodiments of the present application are shown.
参照图4,该装置包括:Referring to Figure 4, the device includes:
问题获取模块401,用于获取输入的第一提问语句;The question obtaining module 401 is used to obtain the input first question sentence;
分词模块402,用于对所述第一提问语句进行分词处理,提取其中包含的各个字段;The word segmentation module 402 is configured to perform word segmentation processing on the first question sentence and extract various fields contained therein;
字段比较模块403,用于将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;The field comparison module 403 is configured to compare each field one by one with the fields in the pre-built field data table, find out the same fields that the various fields and the field data table have, and determine them as target fields;
字段相似度计算模块404,用于分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;The field similarity calculation module 404 is configured to calculate the similarity between the target field and each other field in the field data table except the target field;
问题推荐模块405,用于选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。The question recommendation module 405 is configured to select the field with the highest similarity among the other fields, replace the target field in the first question sentence, and obtain a recommended second question sentence.
进一步的,所述字段相似度计算模块可以包括:Further, the field similarity calculation module may include:
相似度指标计算单元,用于结合所述目标字段的字符串和枚举值,以及所述任意一个其它字段的字符串和枚举值,计算所述目标字段和所述任意一个其它字段的相似度指标,所述相似度指标为用于衡量两个字段之间的相似程度的参数;The similarity index calculation unit is used to combine the string and enumeration value of the target field, and the string and enumeration value of any other field to calculate the similarity between the target field and any other field A degree index, where the similarity index is a parameter used to measure the degree of similarity between two fields;
第一字段相似度计算单元,用于根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度。The first field similarity calculation unit is configured to calculate the similarity between the target field and the any other field according to the similarity index between the target field and the any other field.
进一步的,所述相似度指标计算单元具体可以用于:计算所述目标字段和所述任意一个其它字段的字符串相似度指标、字符串长度相似度指标、枚举值个数相似度指标以及枚举值长度相似度指标;Further, the similarity index calculation unit may be specifically used to calculate a string similarity index, a string length similarity index, a number similarity index of enumerated values, and a string similarity index between the target field and any other field. Enumeration length similarity index;
所述第一字段相似度计算单元具体可以用于:计算所述字符串相似度指标、所述字符串长度相似度指标、所述枚举值个数相似度指标以及所述枚举值长度相似度指标的平均值或者加权平均值,作为所述目标字段和所述任意一个其它字段的相似度。The first field similarity calculation unit may be specifically used to calculate the string similarity index, the string length similarity index, the number of enumerated values similarity index, and the length of the enumerated values are similar. The average or weighted average of the degree indicators is used as the similarity between the target field and any other field.
进一步的,所述字符串相似度指标可以采用以下公式计算:Further, the string similarity index can be calculated using the following formula:
Figure PCTCN2021078031-appb-000012
Figure PCTCN2021078031-appb-000012
其中,s 1表示所述字符串相似度指标,sim表示两个字段具有的相同字符串的个数,short表示两个字段中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度,α是一个超参数,用于控制字符串对相似度的影响; Among them, s 1 represents the string similarity index, sim represents the number of identical strings in the two fields, short represents the length of the string in the shorter field of the two fields, and long represents the length of the string in the two fields. The length of the string of the longer field, α is a hyperparameter used to control the impact of the string on the similarity;
所述字符串长度相似度指标可以采用以下公式计算:The string length similarity index can be calculated using the following formula:
Figure PCTCN2021078031-appb-000013
Figure PCTCN2021078031-appb-000013
其中,s 2表示所述字符串长度相似度指标,short表示两个字段中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度; Wherein, s 2 represents the string length similarity index, short represents the string length of the shorter field of the two fields, and long represents the string length of the longer field of the two fields;
所述枚举值个数相似度指标可以采用以下公式计算:The similarity index of the number of enumerated values can be calculated using the following formula:
Figure PCTCN2021078031-appb-000014
Figure PCTCN2021078031-appb-000014
其中,s 3表示所述枚举值个数相似度指标,min表示两个字段中枚举值数量较少的字段具有的枚举值个数,max表示两个字段中枚举值数量较多的字段具有的枚举值个数; Among them, s 3 represents the similarity index of the number of enumerated values, min represents the number of enumeration values in the field with a small number of enumeration values in the two fields, and max represents the number of enumeration values in the two fields is large. The number of enumeration values that the field has;
所述枚举值长度相似度指标可以采用以下公式计算:The length similarity index of the enumerated values can be calculated using the following formula:
Figure PCTCN2021078031-appb-000015
Figure PCTCN2021078031-appb-000015
其中,s 4表示所述枚举值长度相似度指标,avg_min表示两个字段中枚举值平均长度较短的字段的枚举值平均长度,avg_max表示两个字段中枚举值平均长度较长的字段的枚举值平均长度。 Among them, s 4 represents the length similarity index of the enumeration value, avg_min represents the average length of the enumeration value of the field with the shorter average length of the enumeration value in the two fields, and avg_max represents the longer average length of the enumeration value in the two fields The average length of the enumeration value of the field.
进一步的,所述字段相似度计算模块可以包括:Further, the field similarity calculation module may include:
历史语句查找单元,用于查找输入所述第一提问语句的用户的所有历史提问语句;The historical sentence search unit is used to search for all historical question sentences of the user who input the first question sentence;
共现矩阵构建单元,用于根据所述历史提问语句构建共现矩阵,所述共现矩阵记录所述字段数据表中任意两个字段共同出现于所述用户的同一条历史提问语句中的次数;The co-occurrence matrix construction unit is configured to construct a co-occurrence matrix according to the historical question sentence, the co-occurrence matrix records the number of times any two fields in the field data table appear together in the same historical question sentence of the user ;
第二字段相似度计算单元,用于根据所述共现矩阵计算所述目标字段与所述各个其它字段之间的相似度。The second field similarity calculation unit is configured to calculate the similarity between the target field and the other fields according to the co-occurrence matrix.
进一步的,所述第二字段相似度计算单元可以包括:Further, the second field similarity calculation unit may include:
字段向量提取子单元,用于从所述共现矩阵中分别提取所述目标字段的字段向量以及每个所述其它字段的字段向量,所述字段向量的各个元素分别为相应的字段与所述字段数据表中的各个字段共同出现于所述用户的同一条历史提问语句中的次数;The field vector extraction subunit is used to extract the field vector of the target field and the field vector of each of the other fields from the co-occurrence matrix. Each element of the field vector is the corresponding field and the field vector. The number of times that each field in the field data table appears together in the same historical question sentence of the user;
余弦相似度计算子单元,用于分别计算所述目标字段的字段向量和每个所述其它字段的字段向量之间的余弦相似度,得到所述目标字段与所述各个其它字段之间的相似度。The cosine similarity calculation subunit is used to calculate the cosine similarity between the field vector of the target field and the field vector of each of the other fields to obtain the similarity between the target field and each of the other fields. Spend.
进一步的,所述字段相似度计算模块还可以包括:Further, the field similarity calculation module may further include:
频次最高字段确定单元,用于根据所述共现矩阵确定所述字段数据表中与所述目标字段共同出现于所述用户的同一条历史提问语句中的次数最多的字段;A field determination unit with the highest frequency, configured to determine, according to the co-occurrence matrix, a field in the field data table that co-occurs with the target field in the same historical question sentence of the user the most frequently;
字段替换模块,用于选取所述次数最多的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第三提问语句。The field replacement module is used to select the field with the most frequency and replace the target field in the first question sentence to obtain the recommended third question sentence.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如图1至图3表示的任意一种基于字段相似度计算的问题推荐方法的步骤。另外,所述计算机可读存储介质可以是非易失性,也可以是易失性。The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, any one of those shown in FIGS. 1 to 3 is implemented. The steps of a problem recommendation method based on field similarity calculation. In addition, the computer-readable storage medium may be non-volatile or volatile.
本申请实施例还提供一种服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如图1至图3表示的任意一种基于字段相似度计算的问题推荐方法的步骤。An embodiment of the present application further provides a server, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. The processor executes the computer-readable instructions when the computer-readable instructions are executed. Figures 1 to 3 show the steps of any question recommendation method based on field similarity calculation.
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在服务器上运行时,使得服务器执行实现如图1至图3表示的任意一种基于字段相似度计算的问题推 荐方法的步骤。The embodiment of the present application also provides a computer program product, when the computer program product runs on a server, the server executes the steps of implementing any problem recommendation method based on field similarity calculation as shown in Figs. 1 to 3.
图5是本申请一实施例提供的服务器的示意图。如图5所示,该实施例的服务器5包括:处理器50、存储器51以及存储在所述存储器51中并可在所述处理器50上运行的计算机可读指令52。所述处理器50执行所述计算机可读指令52时实现上述各个基于字段相似度计算的问题推荐方法实施例中的步骤,例如图1所示的步骤101至105。或者,所述处理器50执行所述计算机可读指令52时实现上述各装置实施例中各模块/单元的功能,例如图4所示模块401至405的功能。Fig. 5 is a schematic diagram of a server provided by an embodiment of the present application. As shown in FIG. 5, the server 5 of this embodiment includes a processor 50, a memory 51, and computer-readable instructions 52 stored in the memory 51 and running on the processor 50. When the processor 50 executes the computer-readable instructions 52, the steps in the above-mentioned problem recommendation method embodiments based on field similarity calculation, such as steps 101 to 105 shown in FIG. 1, are implemented. Alternatively, when the processor 50 executes the computer-readable instructions 52, the functions of the modules/units in the foregoing device embodiments, such as the functions of the modules 401 to 405 shown in FIG. 4, are implemented.
示例性的,所述计算机可读指令52可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器51中,并由所述处理器50执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令52在所述服务器5中的执行过程。Exemplarily, the computer-readable instructions 52 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 51 and executed by the processor 50, To complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 52 in the server 5.
所述服务器5可以是智能手机、笔记本、掌上电脑及云端服务器等计算设备。所述服务器5可包括,但不仅限于,处理器50、存储器51。本领域技术人员可以理解,图5仅仅是服务器5的示例,并不构成对服务器5的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述服务器5还可以包括输入输出设备、网络接入设备、总线等。The server 5 may be a computing device such as a smart phone, a notebook, a palmtop computer, and a cloud server. The server 5 may include, but is not limited to, a processor 50 and a memory 51. Those skilled in the art can understand that FIG. 5 is only an example of the server 5, and does not constitute a limitation on the server 5. It may include more or less components than those shown in the figure, or a combination of certain components, or different components, such as The server 5 may also include input and output devices, network access devices, buses, and the like.
所述处理器50可以是中央处理单元(CentraL Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(DigitaL SignaL Processor,DSP)、专用集成电路(AppLication Specific Integrated Circuit,ASIC)、现成可编程门阵列(FieLd-ProgrammabLe Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 50 may be a central processing unit (CentraL Processing Unit, CPU), or other general-purpose processors, digital signal processors (DigitaL Signal Processor, DSP), application specific integrated circuits (AppLication Specific Integrated Circuit, ASIC), Ready-made programmable gate array (FieLd-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
所述存储器51可以是所述服务器5的内部存储单元,例如服务器5的硬盘或内存。所述存储器51也可以是所述服务器5的外部存储设备,例如所述服务器5上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure DigitaL,SD)卡,闪存卡(FLash Card)等。进一步地,所述存储器51还可以既包括所述服务器5的内部存储单元也包括外部存储设备。所述存储器51用于存储所述计算机可读指令以及所述服务器所需的其他程序和数据。所述存储器51还可以用于暂时地存储已经输出或者将要输出的数据。The storage 51 may be an internal storage unit of the server 5, such as a hard disk or a memory of the server 5. The memory 51 may also be an external storage device of the server 5, such as a plug-in hard disk, a smart media card (SMC), or a secure digital (SD) card equipped on the server 5. Flash Card (FLash Card), etc. Further, the storage 51 may also include both an internal storage unit of the server 5 and an external storage device. The memory 51 is used to store the computer readable instructions and other programs and data required by the server. The memory 51 can also be used to temporarily store data that has been output or will be output.
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。It should be noted that the information interaction and execution process between the above-mentioned devices/units are based on the same concept as the method embodiment of this application, and its specific functions and technical effects can be found in the method embodiment section. I won't repeat it here.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实 现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in the present application can be accomplished by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium. For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种基于字段相似度计算的问题推荐方法,其中,包括:A problem recommendation method based on field similarity calculation, which includes:
    获取输入的第一提问语句;Obtain the input first question sentence;
    对所述第一提问语句进行分词处理,提取其中包含的各个字段;Perform word segmentation processing on the first question sentence, and extract various fields contained therein;
    将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;Compare each of the fields with the fields in the pre-built field data table one by one, find out the same fields that the various fields and the field data table have, and determine them as the target field;
    分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;Respectively calculating the similarity between the target field and each other field in the field data table except the target field;
    选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。The field with the highest similarity among the other fields is selected, and the target field in the first question sentence is replaced to obtain a recommended second question sentence.
  2. 如权利要求1所述的问题推荐方法,其中,所述目标字段与所述字段数据表中任意一个其它字段之间的相似度通过以下步骤计算:The question recommendation method according to claim 1, wherein the similarity between the target field and any other field in the field data table is calculated by the following steps:
    结合所述目标字段的字符串和枚举值,以及所述任意一个其它字段的字符串和枚举值,计算所述目标字段和所述任意一个其它字段的相似度指标,所述相似度指标为用于衡量两个字段之间的相似程度的参数;Combining the string and enumeration value of the target field, and the string and enumeration value of any other field, calculate the similarity index between the target field and any other field, and the similarity index It is a parameter used to measure the degree of similarity between two fields;
    根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度。According to the similarity index between the target field and the any other field, the similarity between the target field and the any other field is calculated.
  3. 如权利要求2所述的问题推荐方法,其中,所述计算所述目标字段和所述任意一个其它字段的相似度指标包括:3. The question recommendation method according to claim 2, wherein said calculating a similarity index between said target field and said any other field comprises:
    计算所述目标字段和所述任意一个其它字段的字符串相似度指标、字符串长度相似度指标、枚举值个数相似度指标以及枚举值长度相似度指标;Calculating a string similarity index, a string length similarity index, an enumerated value number similarity index, and an enumerated value length similarity index of the target field and any one of the other fields;
    所述根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度包括:The calculating the similarity between the target field and the any other field according to the similarity index of the target field and the any other field includes:
    计算所述字符串相似度指标、所述字符串长度相似度指标、所述枚举值个数相似度指标以及所述枚举值长度相似度指标的平均值或者加权平均值,作为所述目标字段和所述任意一个其它字段的相似度。Calculate the average or weighted average of the string similarity index, the string length similarity index, the number similarity index of the enumeration value, and the length similarity index of the enumeration value, as the target The similarity between the field and any of the other fields.
  4. 如权利要求3所述的问题推荐方法,其中,所述字符串相似度指标采用以下公式计算:The question recommendation method according to claim 3, wherein the string similarity index is calculated using the following formula:
    Figure PCTCN2021078031-appb-100001
    Figure PCTCN2021078031-appb-100001
    其中,s 1表示所述字符串相似度指标,sim表示两个字段具有的相同字符串的个数,short表示两个字段中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度,α是一个超参数,用于控制字符串对相似度的影响; Among them, s 1 represents the string similarity index, sim represents the number of the same string in the two fields, short represents the string length of the shorter field in the two fields, and long represents the string length in the two fields. The length of the string of the longer field, α is a hyperparameter used to control the impact of the string on the similarity;
    所述字符串长度相似度指标采用以下公式计算:The string length similarity index is calculated using the following formula:
    Figure PCTCN2021078031-appb-100002
    Figure PCTCN2021078031-appb-100002
    其中,s 2表示所述字符串长度相似度指标,short表示两个字段中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度; Wherein, s 2 represents the string length similarity index, short represents the string length of the shorter field of the two fields, and long represents the string length of the longer field of the two fields;
    所述枚举值个数相似度指标采用以下公式计算:The similarity index of the number of enumerated values is calculated using the following formula:
    Figure PCTCN2021078031-appb-100003
    Figure PCTCN2021078031-appb-100003
    其中,s 3表示所述枚举值个数相似度指标,min表示两个字段中枚举值数量较少 的字段具有的枚举值个数,max表示两个字段中枚举值数量较多的字段具有的枚举值个数; Among them, s 3 represents the similarity index of the number of enumerated values, min represents the number of enumeration values in the field with a small number of enumeration values in the two fields, and max represents the number of enumeration values in the two fields is large. The number of enumeration values that the field has;
    所述枚举值长度相似度指标采用以下公式计算:The length similarity index of the enumerated values is calculated using the following formula:
    Figure PCTCN2021078031-appb-100004
    Figure PCTCN2021078031-appb-100004
    其中,s 4表示所述枚举值长度相似度指标,avg_min表示两个字段中枚举值平均长度较短的字段的枚举值平均长度,avg_max表示两个字段中枚举值平均长度较长的字段的枚举值平均长度。 Among them, s 4 represents the length similarity index of the enumeration value, avg_min represents the average length of the enumeration value of the field with the shorter average length of the enumeration value in the two fields, and avg_max represents the longer average length of the enumeration value in the two fields The average length of the enumeration value of the field.
  5. 如权利要求1所述的问题推荐方法,其中,所述分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度包括:5. The question recommendation method according to claim 1, wherein said separately calculating the similarity between said target field and each other field in said field data table except for said target field comprises:
    查找输入所述第一提问语句的用户的所有历史提问语句;Search for all historical question sentences of the user who input the first question sentence;
    根据所述历史提问语句构建共现矩阵,所述共现矩阵记录所述字段数据表中任意两个字段共同出现于所述用户的同一条历史提问语句中的次数;Constructing a co-occurrence matrix according to the historical question sentence, the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user;
    根据所述共现矩阵计算所述目标字段与所述各个其它字段之间的相似度。Calculate the similarity between the target field and each of the other fields according to the co-occurrence matrix.
  6. 如权利要求5所述的问题推荐方法,其中,所述根据所述共现矩阵计算所述目标字段与所述各个其它字段之间的相似度包括:8. The question recommendation method according to claim 5, wherein said calculating the similarity between said target field and said various other fields according to said co-occurrence matrix comprises:
    从所述共现矩阵中分别提取所述目标字段的字段向量以及每个所述其它字段的字段向量,所述字段向量的各个元素分别为相应的字段与所述字段数据表中的各个字段共同出现于所述用户的同一条历史提问语句中的次数;Extract the field vector of the target field and the field vector of each of the other fields from the co-occurrence matrix, each element of the field vector is the corresponding field and each field in the field data table is common The number of times that it appears in the same historical question sentence of the user;
    分别计算所述目标字段的字段向量和每个所述其它字段的字段向量之间的余弦相似度,得到所述目标字段与所述各个其它字段之间的相似度。The cosine similarity between the field vector of the target field and the field vector of each of the other fields is respectively calculated to obtain the similarity between the target field and each of the other fields.
  7. 如权利要求5或6所述的问题推荐方法,其中,在根据所述历史提问语句构建共现矩阵之后,还包括:The question recommendation method according to claim 5 or 6, wherein after constructing a co-occurrence matrix according to the historical question sentence, the method further comprises:
    根据所述共现矩阵确定所述字段数据表中与所述目标字段共同出现于所述用户的同一条历史提问语句中的次数最多的字段;Determining, according to the co-occurrence matrix, a field in the field data table that co-occurs with the target field in the same historical question sentence of the user the most frequently;
    选取所述次数最多的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第三提问语句。The field with the most times is selected, and the target field in the first question sentence is replaced to obtain the recommended third question sentence.
  8. 一种基于字段相似度计算的问题推荐装置,其中,包括:A problem recommendation device based on field similarity calculation, which includes:
    问题获取模块,用于获取输入的第一提问语句;The question acquisition module is used to acquire the input first question sentence;
    分词模块,用于对所述第一提问语句进行分词处理,提取其中包含的各个字段;The word segmentation module is used to perform word segmentation processing on the first question sentence and extract each field contained therein;
    字段比较模块,用于将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;A field comparison module, which is used to compare each field one by one with the fields in the pre-built field data table, find out the same fields that each field and the field data table have, and determine it as a target field;
    字段相似度计算模块,用于分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;A field similarity calculation module, configured to calculate the similarity between the target field and each other field in the field data table except the target field;
    问题推荐模块,用于选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。The question recommendation module is configured to select the field with the highest similarity among the other fields, replace the target field in the first question sentence, and obtain a recommended second question sentence.
  9. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the following steps:
    获取输入的第一提问语句;Obtain the input first question sentence;
    对所述第一提问语句进行分词处理,提取其中包含的各个字段;Perform word segmentation processing on the first question sentence, and extract various fields contained therein;
    将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;Compare each of the fields with the fields in the pre-built field data table one by one, find out the same fields that the various fields and the field data table have, and determine them as the target field;
    分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;Respectively calculating the similarity between the target field and each other field in the field data table except the target field;
    选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述 目标字段进行替换,得到推荐的第二提问语句。The field with the highest similarity among the other fields is selected, and the target field in the first question sentence is replaced to obtain the recommended second question sentence.
  10. 如权利要求9所述的计算机可读存储介质,其中,所述目标字段与所述字段数据表中任意一个其它字段之间的相似度通过以下步骤计算:9. The computer-readable storage medium of claim 9, wherein the similarity between the target field and any other field in the field data table is calculated by the following steps:
    结合所述目标字段的字符串和枚举值,以及所述任意一个其它字段的字符串和枚举值,计算所述目标字段和所述任意一个其它字段的相似度指标,所述相似度指标为用于衡量两个字段之间的相似程度的参数;Combining the string and enumeration value of the target field, and the string and enumeration value of any other field, calculate the similarity index between the target field and any other field, and the similarity index It is a parameter used to measure the degree of similarity between two fields;
    根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度。According to the similarity index between the target field and the any other field, the similarity between the target field and the any other field is calculated.
  11. 如权利要求10所述的计算机可读存储介质,其中,所述计算所述目标字段和所述任意一个其它字段的相似度指标包括:10. The computer-readable storage medium according to claim 10, wherein said calculating a similarity index between said target field and said any other field comprises:
    计算所述目标字段和所述任意一个其它字段的字符串相似度指标、字符串长度相似度指标、枚举值个数相似度指标以及枚举值长度相似度指标;Calculating a string similarity index, a string length similarity index, an enumerated value number similarity index, and an enumerated value length similarity index of the target field and any one of the other fields;
    所述根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度包括:The calculating the similarity between the target field and the any other field according to the similarity index of the target field and the any other field includes:
    计算所述字符串相似度指标、所述字符串长度相似度指标、所述枚举值个数相似度指标以及所述枚举值长度相似度指标的平均值或者加权平均值,作为所述目标字段和所述任意一个其它字段的相似度。Calculate the average or weighted average of the string similarity index, the string length similarity index, the number similarity index of the enumeration value, and the length similarity index of the enumeration value, as the target The similarity between the field and any of the other fields.
  12. 如权利要求11所述的计算机可读存储介质,其中,所述字符串相似度指标采用以下公式计算:11. The computer-readable storage medium of claim 11, wherein the string similarity index is calculated using the following formula:
    Figure PCTCN2021078031-appb-100005
    Figure PCTCN2021078031-appb-100005
    其中,s 1表示所述字符串相似度指标,sim表示两个字段具有的相同字符串的个数,short表示两个字段中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度,α是一个超参数,用于控制字符串对相似度的影响; Among them, s 1 represents the string similarity index, sim represents the number of the same string in the two fields, short represents the string length of the shorter field in the two fields, and long represents the string length in the two fields. The length of the string of the longer field, α is a hyperparameter used to control the impact of the string on the similarity;
    所述字符串长度相似度指标采用以下公式计算:The string length similarity index is calculated using the following formula:
    Figure PCTCN2021078031-appb-100006
    Figure PCTCN2021078031-appb-100006
    其中,s 2表示所述字符串长度相似度指标,short表示两个字段中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度; Wherein, s 2 represents the string length similarity index, short represents the string length of the shorter field of the two fields, and long represents the string length of the longer field of the two fields;
    所述枚举值个数相似度指标采用以下公式计算:The similarity index of the number of enumerated values is calculated using the following formula:
    Figure PCTCN2021078031-appb-100007
    Figure PCTCN2021078031-appb-100007
    其中,s 3表示所述枚举值个数相似度指标,min表示两个字段中枚举值数量较少的字段具有的枚举值个数,max表示两个字段中枚举值数量较多的字段具有的枚举值个数; Among them, s 3 represents the similarity index of the number of enumerated values, min represents the number of enumeration values in the field with a small number of enumeration values in the two fields, and max represents the number of enumeration values in the two fields is large. The number of enumeration values that the field has;
    所述枚举值长度相似度指标采用以下公式计算:The length similarity index of the enumerated values is calculated using the following formula:
    Figure PCTCN2021078031-appb-100008
    Figure PCTCN2021078031-appb-100008
    其中,s 4表示所述枚举值长度相似度指标,avg_min表示两个字段中枚举值平均长度较短的字段的枚举值平均长度,avg_max表示两个字段中枚举值平均长度较长的字段的枚举值平均长度。 Among them, s 4 represents the length similarity index of the enumeration value, avg_min represents the average length of the enumeration value of the field with the shorter average length of the enumeration value in the two fields, and avg_max represents the longer average length of the enumeration value in the two fields The average length of the enumeration value of the field.
  13. 如权利要求9所述的计算机可读存储介质,其中,所述分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度包括:9. The computer-readable storage medium according to claim 9, wherein said separately calculating the similarity between the target field and each other field in the field data table except for the target field comprises:
    查找输入所述第一提问语句的用户的所有历史提问语句;Search for all historical question sentences of the user who input the first question sentence;
    根据所述历史提问语句构建共现矩阵,所述共现矩阵记录所述字段数据表中任意两个字段共同出现于所述用户的同一条历史提问语句中的次数;Constructing a co-occurrence matrix according to the historical question sentence, the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user;
    根据所述共现矩阵计算所述目标字段与所述各个其它字段之间的相似度。Calculate the similarity between the target field and each of the other fields according to the co-occurrence matrix.
  14. 如权利要求13所述的计算机可读存储介质,其中,所述根据所述共现矩阵计算所述目标字段与所述各个其它字段之间的相似度包括:15. The computer-readable storage medium according to claim 13, wherein the calculating the similarity between the target field and the various other fields according to the co-occurrence matrix comprises:
    从所述共现矩阵中分别提取所述目标字段的字段向量以及每个所述其它字段的字段向量,所述字段向量的各个元素分别为相应的字段与所述字段数据表中的各个字段共同出现于所述用户的同一条历史提问语句中的次数;Extract the field vector of the target field and the field vector of each of the other fields from the co-occurrence matrix, each element of the field vector is the corresponding field and each field in the field data table is common The number of times that it appears in the same historical question sentence of the user;
    分别计算所述目标字段的字段向量和每个所述其它字段的字段向量之间的余弦相似度,得到所述目标字段与所述各个其它字段之间的相似度。The cosine similarity between the field vector of the target field and the field vector of each of the other fields is respectively calculated to obtain the similarity between the target field and each of the other fields.
  15. 一种服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如下步骤:A server includes a memory, a processor, and a computer program that is stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer program:
    获取输入的第一提问语句;Obtain the input first question sentence;
    对所述第一提问语句进行分词处理,提取其中包含的各个字段;Perform word segmentation processing on the first question sentence, and extract various fields contained therein;
    将所述各个字段逐一与预先构建的字段数据表中具有的字段进行比较,找出所述各个字段和所述字段数据表具有的相同字段,确定为目标字段;Compare each of the fields with the fields in the pre-built field data table one by one, find out the same fields that the various fields and the field data table have, and determine them as the target field;
    分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度;Respectively calculating the similarity between the target field and each other field in the field data table except the target field;
    选取所述各个其它字段中所述相似度最高的字段,对所述第一提问语句中的所述目标字段进行替换,得到推荐的第二提问语句。The field with the highest similarity among the other fields is selected, and the target field in the first question sentence is replaced to obtain a recommended second question sentence.
  16. 如权利要求15所述的服务器,其中,所述目标字段与所述字段数据表中任意一个其它字段之间的相似度通过以下步骤计算:The server according to claim 15, wherein the similarity between the target field and any other field in the field data table is calculated by the following steps:
    结合所述目标字段的字符串和枚举值,以及所述任意一个其它字段的字符串和枚举值,计算所述目标字段和所述任意一个其它字段的相似度指标,所述相似度指标为用于衡量两个字段之间的相似程度的参数;Combining the string and enumeration value of the target field, and the string and enumeration value of any other field, calculate the similarity index between the target field and any other field, and the similarity index It is a parameter used to measure the degree of similarity between two fields;
    根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度。According to the similarity index between the target field and the any other field, the similarity between the target field and the any other field is calculated.
  17. 如权利要求16所述的服务器,其中,所述计算所述目标字段和所述任意一个其它字段的相似度指标包括:The server according to claim 16, wherein said calculating a similarity index between said target field and said any other field comprises:
    计算所述目标字段和所述任意一个其它字段的字符串相似度指标、字符串长度相似度指标、枚举值个数相似度指标以及枚举值长度相似度指标;Calculating a string similarity index, a string length similarity index, an enumerated value number similarity index, and an enumerated value length similarity index of the target field and any one of the other fields;
    所述根据所述目标字段和所述任意一个其它字段的相似度指标,计算得到所述目标字段和所述任意一个其它字段的相似度包括:The calculating the similarity between the target field and the any other field according to the similarity index of the target field and the any other field includes:
    计算所述字符串相似度指标、所述字符串长度相似度指标、所述枚举值个数相似度指标以及所述枚举值长度相似度指标的平均值或者加权平均值,作为所述目标字段和所述任意一个其它字段的相似度。Calculate the average or weighted average of the string similarity index, the string length similarity index, the number similarity index of the enumeration value, and the length similarity index of the enumeration value, as the target The similarity between the field and any of the other fields.
  18. 如权利要求17所述的服务器,其中,所述字符串相似度指标采用以下公式计算:The server according to claim 17, wherein the string similarity index is calculated using the following formula:
    Figure PCTCN2021078031-appb-100009
    Figure PCTCN2021078031-appb-100009
    其中,s 1表示所述字符串相似度指标,sim表示两个字段具有的相同字符串的个数,short表示两个字段中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度,α是一个超参数,用于控制字符串对相似度的影响; Among them, s 1 represents the string similarity index, sim represents the number of the same string in the two fields, short represents the string length of the shorter field in the two fields, and long represents the string length in the two fields. The length of the string of the longer field, α is a hyperparameter used to control the impact of the string on the similarity;
    所述字符串长度相似度指标采用以下公式计算:The string length similarity index is calculated using the following formula:
    Figure PCTCN2021078031-appb-100010
    Figure PCTCN2021078031-appb-100010
    其中,s 2表示所述字符串长度相似度指标,short表示两个字段中长度较短的字段具有的字符串长度,long表示两个字段中长度较长的字段具有的字符串长度; Wherein, s 2 represents the string length similarity index, short represents the string length of the shorter field of the two fields, and long represents the string length of the longer field of the two fields;
    所述枚举值个数相似度指标采用以下公式计算:The similarity index of the number of enumerated values is calculated using the following formula:
    Figure PCTCN2021078031-appb-100011
    Figure PCTCN2021078031-appb-100011
    其中,s 3表示所述枚举值个数相似度指标,min表示两个字段中枚举值数量较少的字段具有的枚举值个数,max表示两个字段中枚举值数量较多的字段具有的枚举值个数; Among them, s 3 represents the similarity index of the number of enumerated values, min represents the number of enumeration values in the field with a small number of enumeration values in the two fields, and max represents the number of enumeration values in the two fields is large. The number of enumeration values that the field has;
    所述枚举值长度相似度指标采用以下公式计算:The length similarity index of the enumerated values is calculated using the following formula:
    Figure PCTCN2021078031-appb-100012
    Figure PCTCN2021078031-appb-100012
    其中,s 4表示所述枚举值长度相似度指标,avg_min表示两个字段中枚举值平均长度较短的字段的枚举值平均长度,avg_max表示两个字段中枚举值平均长度较长的字段的枚举值平均长度。 Among them, s 4 represents the length similarity index of the enumeration value, avg_min represents the average length of the enumeration value of the field with the shorter average length of the enumeration value in the two fields, and avg_max represents the longer average length of the enumeration value in the two fields The average length of the enumeration value of the field.
  19. 如权利要求15所述的服务器,其中,所述分别计算所述目标字段与所述字段数据表中除所述目标字段外的各个其它字段之间的相似度包括:The server according to claim 15, wherein said separately calculating the similarity between said target field and each other field in said field data table except for said target field comprises:
    查找输入所述第一提问语句的用户的所有历史提问语句;Search for all historical question sentences of the user who input the first question sentence;
    根据所述历史提问语句构建共现矩阵,所述共现矩阵记录所述字段数据表中任意两个字段共同出现于所述用户的同一条历史提问语句中的次数;Constructing a co-occurrence matrix according to the historical question sentence, the co-occurrence matrix records the number of times that any two fields in the field data table appear together in the same historical question sentence of the user;
    根据所述共现矩阵计算所述目标字段与所述各个其它字段之间的相似度。Calculate the similarity between the target field and the other fields according to the co-occurrence matrix.
  20. 如权利要求19所述的服务器,其中,所述根据所述共现矩阵计算所述目标字段与所述各个其它字段之间的相似度包括:The server according to claim 19, wherein the calculating the similarity between the target field and the respective other fields according to the co-occurrence matrix comprises:
    从所述共现矩阵中分别提取所述目标字段的字段向量以及每个所述其它字段的字段向量,所述字段向量的各个元素分别为相应的字段与所述字段数据表中的各个字段共同出现于所述用户的同一条历史提问语句中的次数;Extract the field vector of the target field and the field vector of each of the other fields from the co-occurrence matrix, each element of the field vector is the corresponding field and each field in the field data table is common The number of times that it appears in the same historical question sentence of the user;
    分别计算所述目标字段的字段向量和每个所述其它字段的字段向量之间的余弦相似度,得到所述目标字段与所述各个其它字段之间的相似度。The cosine similarity between the field vector of the target field and the field vector of each of the other fields is respectively calculated to obtain the similarity between the target field and each of the other fields.
PCT/CN2021/078031 2020-04-02 2021-02-26 Question recommendation method and apparatus based on field similarity calculation, and server WO2021196934A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010255040.6 2020-04-02
CN202010255040.6A CN111553151A (en) 2020-04-02 2020-04-02 Question recommendation method and device based on field similarity calculation and server

Publications (1)

Publication Number Publication Date
WO2021196934A1 true WO2021196934A1 (en) 2021-10-07

Family

ID=72005557

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078031 WO2021196934A1 (en) 2020-04-02 2021-02-26 Question recommendation method and apparatus based on field similarity calculation, and server

Country Status (2)

Country Link
CN (1) CN111553151A (en)
WO (1) WO2021196934A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114385623A (en) * 2021-11-30 2022-04-22 北京达佳互联信息技术有限公司 Data table acquisition method, device, apparatus, storage medium, and program product

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553151A (en) * 2020-04-02 2020-08-18 深圳壹账通智能科技有限公司 Question recommendation method and device based on field similarity calculation and server
CN112417271B (en) * 2020-11-09 2023-09-01 杭州讯酷科技有限公司 Intelligent system construction method with field recommendation
CN113673252A (en) * 2021-08-12 2021-11-19 之江实验室 Automatic join recommendation method for data table based on field semantics

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279486A (en) * 2013-04-24 2013-09-04 百度在线网络技术(北京)有限公司 Method and device for providing related searches
US20170091188A1 (en) * 2015-09-28 2017-03-30 International Business Machines Corporation Presenting answers from concept-based representation of a topic oriented pipeline
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN109147934A (en) * 2018-07-04 2019-01-04 平安科技(深圳)有限公司 Interrogation data recommendation method, device, computer equipment and storage medium
CN109509010A (en) * 2017-09-15 2019-03-22 腾讯科技(北京)有限公司 A kind of method for processing multimedia information, terminal and storage medium
CN110162615A (en) * 2019-05-29 2019-08-23 北京市律典通科技有限公司 A kind of intelligent answer method, apparatus, electronic equipment and storage medium
CN111553151A (en) * 2020-04-02 2020-08-18 深圳壹账通智能科技有限公司 Question recommendation method and device based on field similarity calculation and server

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279486A (en) * 2013-04-24 2013-09-04 百度在线网络技术(北京)有限公司 Method and device for providing related searches
US20170091188A1 (en) * 2015-09-28 2017-03-30 International Business Machines Corporation Presenting answers from concept-based representation of a topic oriented pipeline
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN109509010A (en) * 2017-09-15 2019-03-22 腾讯科技(北京)有限公司 A kind of method for processing multimedia information, terminal and storage medium
CN109147934A (en) * 2018-07-04 2019-01-04 平安科技(深圳)有限公司 Interrogation data recommendation method, device, computer equipment and storage medium
CN110162615A (en) * 2019-05-29 2019-08-23 北京市律典通科技有限公司 A kind of intelligent answer method, apparatus, electronic equipment and storage medium
CN111553151A (en) * 2020-04-02 2020-08-18 深圳壹账通智能科技有限公司 Question recommendation method and device based on field similarity calculation and server

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114385623A (en) * 2021-11-30 2022-04-22 北京达佳互联信息技术有限公司 Data table acquisition method, device, apparatus, storage medium, and program product

Also Published As

Publication number Publication date
CN111553151A (en) 2020-08-18

Similar Documents

Publication Publication Date Title
WO2021196934A1 (en) Question recommendation method and apparatus based on field similarity calculation, and server
US11017178B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US9754210B2 (en) User interests facilitated by a knowledge base
WO2020077896A1 (en) Method and apparatus for generating question data, computer device, and storage medium
WO2021174783A1 (en) Near-synonym pushing method and apparatus, electronic device, and medium
CN107704512B (en) Financial product recommendation method based on social data, electronic device and medium
CN111737499B (en) Data searching method based on natural language processing and related equipment
CN110096573B (en) Text parsing method and device
WO2021159738A1 (en) Data recommendation method and device based on medical field, and server and storage medium
CN111782763A (en) Information retrieval method based on voice semantics and related equipment thereof
WO2021196825A1 (en) Abstract generation method and apparatus, and electronic device and medium
US20180285448A1 (en) Producing personalized selection of applications for presentation on web-based interface
CN112559895B (en) Data processing method and device, electronic equipment and storage medium
WO2022105115A1 (en) Question and answer pair matching method and apparatus, electronic device and storage medium
CN111930895A (en) Document data retrieval method, device, equipment and storage medium based on MRC
CN112559709A (en) Knowledge graph-based question and answer method, device, terminal and storage medium
WO2022222942A1 (en) Method and apparatus for generating question and answer record, electronic device, and storage medium
CN111091883A (en) Medical text processing method and device, storage medium and equipment
CN110019474B (en) Automatic synonymy data association method and device in heterogeneous database and electronic equipment
EP3425531A1 (en) System, method, electronic device, and storage medium for identifying risk event based on social information
CN115544214B (en) Event processing method, device and computer readable storage medium
WO2023124837A1 (en) Inquiry processing method and apparatus, device, and storage medium
CN115114420A (en) Knowledge graph question-answering method, terminal equipment and storage medium
CN111324701B (en) Content supplement method, content supplement device, computer equipment and storage medium
CN113468206A (en) Data maintenance method, device, server, medium and product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21779373

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 160123)

122 Ep: pct application non-entry in european phase

Ref document number: 21779373

Country of ref document: EP

Kind code of ref document: A1