WO2020164204A1 - Text template recognition method and apparatus, and computer readable storage medium - Google Patents

Text template recognition method and apparatus, and computer readable storage medium Download PDF

Info

Publication number
WO2020164204A1
WO2020164204A1 PCT/CN2019/088628 CN2019088628W WO2020164204A1 WO 2020164204 A1 WO2020164204 A1 WO 2020164204A1 CN 2019088628 W CN2019088628 W CN 2019088628W WO 2020164204 A1 WO2020164204 A1 WO 2020164204A1
Authority
WO
WIPO (PCT)
Prior art keywords
similarity
text
preset
degree
template
Prior art date
Application number
PCT/CN2019/088628
Other languages
French (fr)
Chinese (zh)
Inventor
刘轲
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020164204A1 publication Critical patent/WO2020164204A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates

Definitions

  • This application relates to the field of natural language processing technology, and in particular to a text template recognition method, device, and computer-readable storage medium.
  • This application provides a text template recognition method, device and computer readable storage medium, the main purpose of which is to improve the efficiency and accuracy of text template recognition.
  • this application also provides a text template recognition method, which includes:
  • the matching text is a text template similar to the preset text template.
  • the present application also provides a text template recognition device, which includes a memory and a processor.
  • the memory stores a text template recognition program that can run on the processor. The following steps are implemented when the recognition program is executed by the processor:
  • the matching text is a text template similar to the preset text template.
  • the present application also provides a computer-readable storage medium having a text template recognition program stored on the computer-readable storage medium, and the text template recognition program can be executed by one or more processors, To realize the steps of the text template recognition method as described above.
  • the text template recognition method, text template recognition device, and computer-readable storage medium proposed in this application obtain a preset text template and a matching text; calculate the difference between the matching text and the preset text template according to a text similarity algorithm based on word frequency
  • the first degree of similarity; and/or the second degree of similarity between the matched text and the preset text template is calculated according to a semantic-based text similarity algorithm; when the first degree of similarity and/or the second degree of similarity
  • the preset similarity condition is satisfied, it is determined that the matching text is a text template similar to the preset text template.
  • the text module similar to the preset text template can be quickly obtained, which achieves the purpose of improving the efficiency of text template recognition, and when calculating text similarity, the text similarity based on word frequency
  • the calculation of the degree algorithm and/or the semantic-based text similarity algorithm can improve the accuracy of text template recognition.
  • FIG. 1 is a schematic flowchart of a text template recognition method provided by an embodiment of this application
  • FIG. 2 is a schematic diagram of the internal structure of a text template recognition device provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of modules of a text template recognition program in a text template recognition device provided by an embodiment of the application.
  • This application provides a text template recognition method.
  • FIG. 1 it is a schematic flowchart of a text template recognition method provided by the first embodiment of this application. The method can be executed by an electronic device.
  • the text template recognition method includes:
  • Step S10 Obtain a preset text template and matching text.
  • the preset text template may be a text template pre-stored in a preset storage area (for example, stored in an electronic device).
  • the preset text template can be obtained by the user and stored in a preset storage area, or the preset text template is obtained by analyzing several texts of similar words and extracting similar keywords in the text.
  • the preset text template is any text template in a text template collection, and the text collection is all text templates of the same type, or the text collection includes various types of text templates.
  • the obtaining of the preset text template includes: obtaining a text template collection; obtaining a text template in the text template collection.
  • the matched text is text that needs to be judged whether it is a similar text template.
  • the matched text can consist of one or more sentences.
  • Step S20 Calculate the first similarity between the matching text and the preset text template according to the word frequency-based text similarity algorithm and/or calculate the matching text and the preset text according to the semantic-based text similarity algorithm The second degree of similarity of the template.
  • the word frequency-based text similarity algorithm calculates the similarity between two texts by the appearance frequency of words; the semantic-based text similarity algorithm calculates the similarity between two texts by the semantics of this.
  • the calculation of the first similarity between the matched text and the preset text template according to the word frequency-based text similarity algorithm and/or the semantic-based text similarity algorithm Calculating the second similarity between the matched text and the preset text template includes:
  • the LDA document topic generation model is used to calculate the second similarity between the matching text and the preset text model.
  • the vector space model is used to calculate the first similarity between the matched text and the preset text template.
  • Using the Vector Space Model (SVM) to calculate the first similarity between the matched text and the preset text template includes:
  • the preprocessing operations include, but are not limited to, word segmentation and stop-word removal (including words, symbols, punctuation, garbled characters that have little meaning to the text content, such as "this” " ⁇ ", “ ⁇ ”, etc.) to obtain the preprocessed matching text and the preprocessed preset text template;
  • the first keyword is determined from the frequency of words in the preprocessed matching text
  • the second keyword is determined from the frequency of words in the preprocessed preset text template, where both the first keyword and the second keyword can be used Contains multiple words
  • a word with a frequency greater than a preset frequency in the preprocessed matched text is the first keyword.
  • inverse document frequency is an index used to measure the weight of keywords.
  • the first vector and the second vector are obtained according to the following formula:
  • T1 is a keyword
  • W1 is the reverse text frequency of the keyword
  • T2 is another keyword
  • W2 is the reverse text frequency of the keyword
  • Tn is the nth keyword
  • Wn is the keyword The reverse text frequency of the keyword.
  • the content correlation between two texts Sim (D1, D2) is usually expressed by the cosine value of the angle between the vectors. Therefore, the first vector space model of the matching text and the preset text template are obtained. After the second vector, the cosine of the first vector and the second vector is calculated to obtain the first similarity between the pre-matched text and the preset text template.
  • the formula for calculating the cosine can be obtained from the prior art and will not be repeated here.
  • the text is simplified as an N-dimensional vector with the weight of the feature item (keyword) as the component, which simplifies the complex relationship between keywords in the text, makes the model computable, and can Quickly obtain the first similarity between the matched text and the preset text template. .
  • the basic idea of the LDA (Latent Dirichlet Allocation, implicit Dirichlet distribution) model is to describe the document as a topic probability distribution and further describe the topic as a term probability distribution. Specifically, how to calculate the second similarity between the matched text and the preset text model according to the LDA document topic generation model can be obtained from the prior art, and will not be repeated here.
  • Step S30 When the first similarity degree and/or the second similarity degree satisfy a preset similarity degree condition, it is determined that the matching text is a text template similar to the preset text template.
  • the preset similarity condition may be preset.
  • that the first similarity or the second similarity meets a preset similarity condition includes:
  • the first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
  • the first preset similarity degree and the second preset similarity degree may be preset as required, and the values of the first preset similarity degree and the second preset similarity degree may be the same or different.
  • the first preset similarity is 85%, and the second preset similarity is 90%; or, both the first preset similarity and the second preset similarity are 90%.
  • that the first similarity and the second similarity satisfy a preset similarity condition includes:
  • the third similarity is greater than the third preset similarity, it is determined that the first similarity and the second similarity satisfy a preset similarity condition.
  • Linear weighting is to give a certain weight value to the first similarity and the second similarity and then add them to obtain the third similarity.
  • the third preset similarity degree may be preset.
  • the first similarity and the second similarity are input into a preset linear weighting formula, and the third similarity between the matched text and the preset text template is output, and the preset linear weighting formula is:
  • sim TFIDF (p, q) is the first similarity degree
  • sim LDA (p, q) is the second similarity degree
  • sim (p, q) is the third degree of similarity
  • ⁇ and ⁇ are preset weight values.
  • the method further includes: obtaining a weight value for linear weighting.
  • the obtaining a weight value for linear weighting includes:
  • the first initial value is a weight value used for linear weighting
  • the first initial value is adjusted, and the operation of calculating the third similarity according to the first initial value is performed.
  • the clustering result is that the matching template and the preset text template are in the same category, or the matching template and the preset text template are not in the same category.
  • the first initial value may be 0.1.
  • it may be increased by 0.1 each time. For example, if the obtained weight is ⁇ , that is, when the value is initially assigned, ⁇ is 0.1, and ⁇ is 0.9 at this time.
  • the third similarity between the matching text and the preset text template is calculated according to the preset linear weighting formula, and the clustering algorithm is used Determine whether the matching template and the preset text template are in the same category. If the third similarity is less than 50%, and the clustering algorithm determines that the matching template and the preset text template are not in the same category, then it is determined whether the matching template and the preset text template are in the same category. The third similarity is not accurate.
  • the matched text when it is determined that the matched text is a text template similar to the preset text template, the matched text can be added to the template set of the preset text template, so that through this embodiment, multiple text template sets can be obtained , Each text template collection contains similar text templates.
  • the text template recognition method proposed in this embodiment obtains a preset text template and a matching text; calculates the first similarity between the matching text and the preset text template according to a text similarity algorithm based on word frequency; and/or according to The semantic text similarity algorithm calculates the second similarity between the matched text and the preset text template; when the first similarity and/or the second similarity meets the preset similarity condition, the The matched text is a text template similar to the preset text template.
  • the text module similar to the preset text template can be quickly obtained, which achieves the purpose of improving the efficiency of text template recognition, and when calculating text similarity, the text similarity based on word frequency
  • the calculation of the degree algorithm and/or the semantic-based text similarity algorithm can improve the accuracy of text template recognition.
  • the application also provides a text template recognition device.
  • FIG. 2 it is a schematic diagram of the internal structure of a text template recognition device provided by an embodiment of this application.
  • the text template recognition device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer.
  • the text template recognition device 1 at least includes a memory 11, a processor 12, a network interface 13, and a communication bus 14.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 11 may be an internal storage unit of the text template recognition device 1 in some embodiments, such as a hard disk of the text template recognition device 1.
  • the memory 11 may also be an external storage device of the text template recognition device 1, such as a plug-in hard disk equipped on the text template recognition device 1, a smart media card (SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the text template recognition apparatus 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the text template recognition device 1, such as the code of the text template recognition program 01, etc., but also to temporarily store data that has been output or will be output.
  • the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor or other data processing chip in some embodiments, and is used to run the program code or processing stored in the memory 11 Data, such as executing text template recognition program 01, etc.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor or other data processing chip in some embodiments, and is used to run the program code or processing stored in the memory 11 Data, such as executing text template recognition program 01, etc.
  • the network interface 13 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.
  • a standard wired interface and a wireless interface such as a WI-FI interface
  • the communication bus 14 is used to realize the connection and communication between these components.
  • the text template recognition device 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the text template recognition device 1 and to display a visualized user interface.
  • Figure 2 only shows the text template recognition device 1 with components 11-14 and the text template recognition program 01. Those skilled in the art can understand that the structure shown in Figure 2 does not constitute a limitation on the text template recognition device 1 It may include fewer or more components than shown, or a combination of some components, or a different component arrangement.
  • a text template recognition program 01 is stored in the memory 11; the processor 12 implements the following steps when executing the text template recognition program 01 stored in the memory 11:
  • the preset text template may be a text template pre-stored in a preset storage area (for example, stored in an electronic device).
  • the preset text template can be obtained by the user and stored in a preset storage area, or the preset text template is obtained by analyzing several texts of similar words and extracting similar keywords in the text.
  • the preset text template is any text template in a text template collection, and the text collection is all text templates of the same type, or the text collection includes various types of text templates.
  • the obtaining of the preset text template includes: obtaining a text template collection; obtaining a text template in the text template collection.
  • the matched text is text that needs to be judged whether it is a similar text template.
  • the matched text can consist of one or more sentences.
  • the word frequency-based text similarity algorithm calculates the similarity between two texts by the appearance frequency of words; the semantic-based text similarity algorithm calculates the similarity between two texts by the semantics of this.
  • the calculation of the first similarity between the matched text and the preset text template according to the word frequency-based text similarity algorithm and the calculation of the first similarity between the matching text and the preset text template according to the semantic-based text similarity algorithm includes:
  • the LDA document topic generation model is used to calculate the second similarity between the matching text and the preset text model.
  • the vector space model is used to calculate the first similarity between the matched text and the preset text template.
  • Using the Vector Space Model (SVM) to calculate the first similarity between the matched text and the preset text template includes:
  • the preprocessing operations include, but are not limited to, word segmentation and stop-word removal (including words, symbols, punctuation, garbled characters that have little meaning to the text content, such as "this” " ⁇ ", “ ⁇ ”, etc.) to obtain the preprocessed matching text and the preprocessed preset text template;
  • the first keyword is determined from the frequency of words in the preprocessed matching text
  • the second keyword is determined from the frequency of words in the preprocessed preset text template, where both the first keyword and the second keyword can be used Contains multiple words
  • a word with a frequency greater than a preset frequency in the preprocessed matched text is the first keyword.
  • inverse document frequency is an index used to measure the weight of keywords.
  • the first vector and the second vector are obtained according to the following formula:
  • T1 is a keyword
  • W1 is the reverse text frequency of the keyword
  • T2 is another keyword
  • W2 is the reverse text frequency of the keyword
  • Tn is the nth keyword
  • Wn is the keyword The reverse text frequency of the keyword.
  • the content correlation between two texts Sim (D1, D2) is usually expressed by the cosine value of the angle between the vectors. Therefore, the first vector space model of the matching text and the preset text template are obtained. After the second vector, the cosine of the first vector and the second vector is calculated to obtain the first similarity between the pre-matched text and the preset text template.
  • the formula for calculating the cosine can be obtained from the prior art and will not be repeated here.
  • the text is simplified as an N-dimensional vector with the weight of the feature item (keyword) as the component, which simplifies the complex relationship between keywords in the text, makes the model computable, and can Quickly obtain the first similarity between the matched text and the preset text template. .
  • the basic idea of the LDA (Latent Dirichlet Allocation, implicit Dirichlet distribution) model is to describe the document as a topic probability distribution and further describe the topic as a term probability distribution. Specifically, how to calculate the second similarity between the matched text and the preset text model according to the LDA document topic generation model can be obtained from the prior art, and will not be repeated here.
  • the matching text is a text template similar to the preset text template.
  • the preset similarity condition may be preset.
  • that the first similarity or the second similarity meets a preset similarity condition includes:
  • the first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
  • the first preset similarity degree and the second preset similarity degree may be preset as required, and the values of the first preset similarity degree and the second preset similarity degree may be the same or different.
  • the first preset similarity is 85%, and the second preset similarity is 90%; or, both the first preset similarity and the second preset similarity are 90%.
  • that the first similarity and the second similarity satisfy a preset similarity condition includes:
  • the third similarity is greater than the third preset similarity, it is determined that the first similarity and the second similarity satisfy a preset similarity condition.
  • Linear weighting is to give a certain weight value to the first similarity and the second similarity and then add them to obtain the third similarity.
  • the third preset similarity degree may be preset.
  • the first similarity and the second similarity are input into a preset linear weighting formula, and the third similarity between the matched text and the preset text template is output, and the preset linear weighting formula is:
  • sim TFIDF (p, q) is the first similarity degree
  • sim LDA (p, q) is the second similarity degree
  • sim (p, q) is the third degree of similarity
  • ⁇ and ⁇ are preset weight values.
  • the text template recognition program is executed by the processor, and the following steps are further implemented:
  • the obtaining a weight value for linear weighting includes:
  • the first initial value is a weight value used for linear weighting
  • the first initial value is adjusted, and the operation of calculating the third similarity according to the first initial value is performed.
  • the clustering result is that the matching template and the preset text template are in the same category, or the matching template and the preset text template are not in the same category.
  • the first initial value may be 0.1.
  • it may be increased by 0.1 each time. For example, if the obtained weight is ⁇ , that is, when the value is initially assigned, ⁇ is 0.1, and ⁇ is 0.9 at this time.
  • the third similarity between the matching text and the preset text template is calculated according to the preset linear weighting formula, and the clustering algorithm is used Determine whether the matching template and the preset text template are in the same category. If the third similarity is less than 50%, and the clustering algorithm determines that the matching template and the preset text template are not in the same category, then it is determined whether the matching template and the preset text template are in the same category. The third similarity is not accurate.
  • the matched text when it is determined that the matched text is a text template similar to the preset text template, the matched text can be added to the template set of the preset text template, so that through this embodiment, multiple text template sets can be obtained , Each text template collection contains similar text templates.
  • the text template recognition device proposed in this embodiment obtains a preset text template and a matching text; calculates the first similarity between the matching text and the preset text template according to a text similarity algorithm based on word frequency; and/or according to The semantic text similarity algorithm calculates the second similarity between the matched text and the preset text template; when the first similarity and/or the second similarity meets the preset similarity condition, the The matched text is a text template similar to the preset text template.
  • the text module similar to the preset text template can be quickly obtained, which achieves the purpose of improving the efficiency of text template recognition, and when calculating text similarity, the text similarity based on word frequency
  • the calculation of the degree algorithm and/or the semantic-based text similarity algorithm can improve the accuracy of the recognition of the text template.
  • the text template recognition program may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (in this embodiment, The processor 12) is executed to complete the application.
  • the module referred to in the application refers to a series of computer program instruction segments capable of completing specific functions, and is used to describe the execution process of the text template recognition program in the text template recognition device.
  • FIG. 3 a schematic diagram of the program modules of the text template recognition program 01 in an embodiment of the text template recognition device of this application.
  • the text template recognition program can be divided into an acquisition module 10 and a calculation module 20.
  • determining module 30, exemplarily:
  • the obtaining module 10 is used for: obtaining a preset text template and matching text;
  • the calculation module 20 is configured to: calculate the first similarity between the matching text and the preset text template according to a text similarity algorithm based on word frequency; and/or calculate the matching text and the matching text according to a semantic-based text similarity algorithm State the second similarity of the preset text template;
  • the determining module 30 is configured to determine that the matched text is a text template similar to the preset text template when the first similarity degree and/or the second similarity degree satisfy a preset similarity degree condition.
  • an embodiment of the present application also proposes a computer-readable storage medium that stores a text template recognition program on the computer-readable storage medium, and the text template recognition program can be executed by one or more processors to achieve the following operating:
  • the matching text is a text template similar to the preset text template.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed in the present application is a test template recognition method, the method comprising: acquiring a preset text template and a matching text; on the basis of a word frequency-based text similarity algorithm, calculating a first degree of similarity between the matching text and the preset text template; and/or, on the basis of a semantics-based text similarity algorithm, calculating a second degree of similarity between the matching text and the preset text template; and, when the first degree of similarity and/or the second degree of similarity meets a preset similarity condition, determining that the matching text is a text template similar to the preset text template. Also provided in the present application are a text template recognition apparatus and a computer readable storage medium. The present application can improve the efficiency and accuracy of text template recognition.

Description

文本模板识别方法、装置及计算机可读存储介质Text template recognition method, device and computer readable storage medium
本申请要求于2019年2月11日提交中国专利局,申请号为201910109887.0、发明名称为“文本模板识别方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 11, 2019. The application number is 201910109887.0 and the invention title is "Text template recognition method, device and computer readable storage medium". The entire content of the patent application is by reference Incorporated in this application.
技术领域Technical field
本申请涉及自然语言处理技术领域,尤其涉及一种文本模板识别方法、装置及计算机可读存储介质。This application relates to the field of natural language processing technology, and in particular to a text template recognition method, device, and computer-readable storage medium.
背景技术Background technique
随着互联网技术的发展,各行各业的人们都能够通过网络平台自由发布和下载信息,这使得网络上的信息越来越多,大数据分析即对网络上海量的数据进行分析进而提取所需的信息。在进行大数据分析时有时需要用到文本模板,即包含某些特定文字的文本信息。通常,相同的文本信息或类似的文本信息可以对应一个文本模板。现有技术中,获取文本模板的方法通常是由工作人员从各种信息中进行提取,然而这种方法耗时耗力,工作人员需要花费很长的时间去识别进而获取文本模块。With the development of Internet technology, people in all walks of life can freely publish and download information through online platforms, which makes the information on the Internet more and more. Big data analysis is to analyze the amount of data on the Internet and extract what they need Information. Sometimes you need to use a text template when analyzing big data, that is, text information that contains some specific words. Generally, the same text information or similar text information can correspond to a text template. In the prior art, the method for obtaining the text template is usually by the worker extracting various information, but this method is time-consuming and labor-intensive, and the worker needs a long time to identify and then obtain the text module.
发明内容Summary of the invention
本申请提供一种文本模板识别方法、装置及计算机可读存储介质,其主要目的在于提高文本模板识别的效率和准确度。This application provides a text template recognition method, device and computer readable storage medium, the main purpose of which is to improve the efficiency and accuracy of text template recognition.
为实现上述目的,本申请还提供一种文本模板识别方法,该方法包括:To achieve the above objective, this application also provides a text template recognition method, which includes:
获取预设文本模板和匹配文本;Obtain the preset text template and matching text;
根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或Calculate the first similarity between the matched text and the preset text template according to a text similarity algorithm based on word frequency; and/or
根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;Calculating the second similarity between the matched text and the preset text template according to a semantic-based text similarity algorithm;
当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所 述匹配文本为与所述预设文本模板相似的文本模板。When the first similarity degree and/or the second similarity degree satisfy a preset similarity degree condition, it is determined that the matching text is a text template similar to the preset text template.
此外,为实现上述目的,本申请还提供一种文本模板识别装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的文本模板识别程序,所述文本模板识别程序被所述处理器执行时实现如下步骤:In addition, in order to achieve the above object, the present application also provides a text template recognition device, which includes a memory and a processor. The memory stores a text template recognition program that can run on the processor. The following steps are implemented when the recognition program is executed by the processor:
获取预设文本模板和匹配文本;Obtain the preset text template and matching text;
根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或Calculate the first similarity between the matched text and the preset text template according to a text similarity algorithm based on word frequency; and/or
根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;Calculating the second similarity between the matched text and the preset text template according to a semantic-based text similarity algorithm;
当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。When the first degree of similarity and/or the second degree of similarity satisfy a preset similarity condition, it is determined that the matching text is a text template similar to the preset text template.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有文本模板识别程序,所述文本模板识别程序可被一个或者多个处理器执行,以实现如上所述的文本模板识别方法的步骤。In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium having a text template recognition program stored on the computer-readable storage medium, and the text template recognition program can be executed by one or more processors, To realize the steps of the text template recognition method as described above.
本申请提出的文本模板识别方法、文本模板识别装置及计算机可读存储介质,获取预设文本模板和匹配文本;根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。无需工作人员一一人工判断,就能够快速的获取到与预设文本模板相似的文本模块,实现了提高文本模板识别的效率的目的,并且,在计算文本相似度时,通过基于词频的文本相似度算法和/或基于语义的文本相似度算法进行计算,能够提高文本模板识别的准确度。The text template recognition method, text template recognition device, and computer-readable storage medium proposed in this application obtain a preset text template and a matching text; calculate the difference between the matching text and the preset text template according to a text similarity algorithm based on word frequency The first degree of similarity; and/or the second degree of similarity between the matched text and the preset text template is calculated according to a semantic-based text similarity algorithm; when the first degree of similarity and/or the second degree of similarity When the preset similarity condition is satisfied, it is determined that the matching text is a text template similar to the preset text template. Without the need for staff to judge one by one, the text module similar to the preset text template can be quickly obtained, which achieves the purpose of improving the efficiency of text template recognition, and when calculating text similarity, the text similarity based on word frequency The calculation of the degree algorithm and/or the semantic-based text similarity algorithm can improve the accuracy of text template recognition.
附图说明Description of the drawings
图1为本申请一实施例提供的文本模板识别方法的流程示意图;FIG. 1 is a schematic flowchart of a text template recognition method provided by an embodiment of this application;
图2为本申请一实施例提供的文本模板识别装置的内部结构示意图;2 is a schematic diagram of the internal structure of a text template recognition device provided by an embodiment of the application;
图3为本申请一实施例提供的文本模板识别装置中文本模板识别程序的模块示意图。FIG. 3 is a schematic diagram of modules of a text template recognition program in a text template recognition device provided by an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步 说明。The realization, functional characteristics and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
本申请提供一种文本模板识别方法。参照图1所示,为本申请第一实施例提供的文本模板识别方法的流程示意图。该方法可以由一个电子装置执行。This application provides a text template recognition method. Referring to FIG. 1, it is a schematic flowchart of a text template recognition method provided by the first embodiment of this application. The method can be executed by an electronic device.
在本实施例中,文本模板识别方法包括:In this embodiment, the text template recognition method includes:
步骤S10:获取预设文本模板和匹配文本。Step S10: Obtain a preset text template and matching text.
所述预设文本模板可以是预先存储在预设存储区的(例如存储在电子设备的)文本模板。该预设文本模板可由用户获取并保存在预设存储区,或者,该预设文本模板由通过分析若干类似词语的文本,并提取该文本中相似的关键词,得到预设文本模板。The preset text template may be a text template pre-stored in a preset storage area (for example, stored in an electronic device). The preset text template can be obtained by the user and stored in a preset storage area, or the preset text template is obtained by analyzing several texts of similar words and extracting similar keywords in the text.
一种可能的实施例中,预设文本模板为一个文本模板集合中的任意一个文本模板,该文本集合中都为同一类文本模板,或者该文本集合中包括各种不同类的文本模板。所述获取预设文本模板包括:获取文本模板集合;获取所述文本模板集合中的一文本模板。In a possible embodiment, the preset text template is any text template in a text template collection, and the text collection is all text templates of the same type, or the text collection includes various types of text templates. The obtaining of the preset text template includes: obtaining a text template collection; obtaining a text template in the text template collection.
所述匹配文本是需要进行判断是否为相似文本模板的文本。该匹配文本可以由一个或多个语句组成。The matched text is text that needs to be judged whether it is a similar text template. The matched text can consist of one or more sentences.
步骤S20:根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度。Step S20: Calculate the first similarity between the matching text and the preset text template according to the word frequency-based text similarity algorithm and/or calculate the matching text and the preset text according to the semantic-based text similarity algorithm The second degree of similarity of the template.
所述基于词频的文本相似度算法通过词的出现频率来计算两个文本之间相似度;所述基于语义的文本相似度算法通过此的语义来计算两个文本之间的相似度。The word frequency-based text similarity algorithm calculates the similarity between two texts by the appearance frequency of words; the semantic-based text similarity algorithm calculates the similarity between two texts by the semantics of this.
具体的基于词频的文本相似度算法以及所述基于语义的文本相似度算法可以从现有技术中获取,此处不再赘述。The specific word frequency-based text similarity algorithm and the semantic-based text similarity algorithm can be obtained from the prior art, and will not be repeated here.
可选的,在发明另一实施例中,所述根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度包括:Optionally, in another embodiment of the invention, the calculation of the first similarity between the matched text and the preset text template according to the word frequency-based text similarity algorithm and/or the semantic-based text similarity algorithm Calculating the second similarity between the matched text and the preset text template includes:
利用向量空间模型计算所述匹配文本与所述预设文本模板的第一相似度;Calculating the first similarity between the matched text and the preset text template by using a vector space model;
利用LDA文档主题生成模型计算所述匹配文本与所述预设文本模型的第二相似度。The LDA document topic generation model is used to calculate the second similarity between the matching text and the preset text model.
在本实施例中利用向量空间模型计算匹配文本与预设文本模板的第一相似度。利用所述向量空间模型(Vector Space Model,SVM)计算匹配文本与预设文本模板的第一相似度包括:In this embodiment, the vector space model is used to calculate the first similarity between the matched text and the preset text template. Using the Vector Space Model (SVM) to calculate the first similarity between the matched text and the preset text template includes:
对匹配文本和预设文本模板进行预处理操作,所述预处理操作包括但不限于分词、去停用词(包括对文本内容意义不大的词、符号、标点、乱码等,如“这”“的”“呀”等),得到预处理后的匹配文本和预处理后的预设文本模板;Perform preprocessing operations on the matched text and preset text templates. The preprocessing operations include, but are not limited to, word segmentation and stop-word removal (including words, symbols, punctuation, garbled characters that have little meaning to the text content, such as "this" "的", "呀", etc.) to obtain the preprocessed matching text and the preprocessed preset text template;
从预处理后的匹配文本中词语的频率确定第一关键词,以及从预处理后的预设文本模板中词语的频率确定第二关键词,其中,第一关键词和第二关键词都可包含多个词语;The first keyword is determined from the frequency of words in the preprocessed matching text, and the second keyword is determined from the frequency of words in the preprocessed preset text template, where both the first keyword and the second keyword can be used Contains multiple words;
例如,确定预处理后的匹配文本中出现频率大于预设频率的词语为第一关键词。For example, it is determined that a word with a frequency greater than a preset frequency in the preprocessed matched text is the first keyword.
在确定第一关键词和第二关键词之后,计算第一关键词的逆向文本频率,以及第二关键词的逆向文本频率,并生成表示匹配文本的第一向量和表示预设文本模板的第二向量;After determining the first keyword and the second keyword, calculate the reverse text frequency of the first keyword and the reverse text frequency of the second keyword, and generate the first vector representing the matched text and the first vector representing the preset text template Two vectors
其中,逆向文本频率(inverse document frequency,IDF)是用于衡量关键词权重的指数。Among them, inverse document frequency (IDF) is an index used to measure the weight of keywords.
某一关键词的逆向文本频率可以根据其公式IDF=log(D/D w)进行计算,其中,D为样本数据库中文本的总数量,D w为关键词出现过的文本的数量。 The reverse text frequency of a certain keyword can be calculated according to its formula IDF=log(D/D w ), where D is the total number of texts in the sample database, and D w is the number of texts where the keyword has appeared.
本实施例中,根据以下公式得到第一向量和第二向量:In this embodiment, the first vector and the second vector are obtained according to the following formula:
D=D(T1,W1;T2,W2;…,Tn,Wn)D=D(T1, W1; T2, W2;..., Tn, Wn)
其中,T1为一个关键词,W1为该关键词的逆向文本频率;T2为另一个关键词,W2为该关键词的逆向文本频率;以此类推,Tn为第n个关键词,Wn为该关键词的逆向文本频率。Among them, T1 is a keyword, W1 is the reverse text frequency of the keyword; T2 is another keyword, W2 is the reverse text frequency of the keyword; and so on, Tn is the nth keyword, Wn is the keyword The reverse text frequency of the keyword.
在向量空间模型中,两个文本之间的内容相关度Sim(D1,D2)常用向量之间夹角的余弦值表示,因此,在得到匹配文本的第一向量空间模型和预设文本模板的第二向量之后,计算第一向量与第二向量的余弦,从而得到预匹配文本与预设文本模板的第一相似度,计算余弦的公式可以从现有技术中 获取,此处不再赘述。In the vector space model, the content correlation between two texts Sim (D1, D2) is usually expressed by the cosine value of the angle between the vectors. Therefore, the first vector space model of the matching text and the preset text template are obtained. After the second vector, the cosine of the first vector and the second vector is calculated to obtain the first similarity between the pre-matched text and the preset text template. The formula for calculating the cosine can be obtained from the prior art and will not be repeated here.
在本实施例中,将文本简化为以特征项(关键词)的权重为分量的N维向量进行表示,简化了文本中关键词之间的复杂关系,使模型具备了可计算性,进而能够快速得到匹配文本以及预设文本模板之间的第一相似度。。In this embodiment, the text is simplified as an N-dimensional vector with the weight of the feature item (keyword) as the component, which simplifies the complex relationship between keywords in the text, makes the model computable, and can Quickly obtain the first similarity between the matched text and the preset text template. .
本实施例中,LDA(Latent Dirichlet Allocation,隐含狄利克雷分布)模型的基本思想是将文档描述为主题概率分布并进一步将主题描述为词项概率分布。具体的,如何根据LDA文档主题生成模型计算匹配文本与预设文本模型的第二相似度可以从现有技术中获取,此处不再赘述。In this embodiment, the basic idea of the LDA (Latent Dirichlet Allocation, implicit Dirichlet distribution) model is to describe the document as a topic probability distribution and further describe the topic as a term probability distribution. Specifically, how to calculate the second similarity between the matched text and the preset text model according to the LDA document topic generation model can be obtained from the prior art, and will not be repeated here.
步骤S30:当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。Step S30: When the first similarity degree and/or the second similarity degree satisfy a preset similarity degree condition, it is determined that the matching text is a text template similar to the preset text template.
所述预设相似度条件可以为预先设置的。The preset similarity condition may be preset.
可选的,在本申请另一实施例中,所述第一相似度或所述第二相似度满足预设相似度条件包括:Optionally, in another embodiment of the present application, that the first similarity or the second similarity meets a preset similarity condition includes:
所述第一相似度大于第一预设相似度或所述第二相似度大于第二预设相似度。The first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
所述第一预设相似度和所述第二预设相似度可以根据需要预先设置,所述第一预设相似度和所述第二预设相似度的值可以相同或者不同。例如,第一预设相似度为85%,第二预设相似度为90%;或者,第一预设相似度和所述第二预设相似度都为90%。The first preset similarity degree and the second preset similarity degree may be preset as required, and the values of the first preset similarity degree and the second preset similarity degree may be the same or different. For example, the first preset similarity is 85%, and the second preset similarity is 90%; or, both the first preset similarity and the second preset similarity are 90%.
可选的,在本申请另一实施例中,所述第一相似度和所述第二相似度满足预设相似度条件包括:Optionally, in another embodiment of the present application, that the first similarity and the second similarity satisfy a preset similarity condition includes:
根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度;Performing linear weighting according to the first similarity and the second similarity to obtain the third similarity between the matched text and the preset text template;
判断所述第三相似度是否大于第三预设相似度;Judging whether the third similarity is greater than the third preset similarity;
若所述第三相似度大于所述第三预设相似度,确定所述第一相似度和所述第二相似度满足预设相似条件。If the third similarity is greater than the third preset similarity, it is determined that the first similarity and the second similarity satisfy a preset similarity condition.
线性加权即对第一相似度和第二相似度赋予一定的权重值再相加,得到第三相似度。Linear weighting is to give a certain weight value to the first similarity and the second similarity and then add them to obtain the third similarity.
所述第三预设相似度可以是预先设置的。The third preset similarity degree may be preset.
可选的,在本申请另一实施例中,所述根据所述第一相似度与所述第二 相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度包括:Optionally, in another embodiment of the present application, the linear weighting is performed according to the first similarity and the second similarity to obtain the third similarity between the matched text and the preset text template include:
将所述第一相似度、所述第二相似度输入至预设线性加权公式,输出所述匹配文本与所述预设文本模板的第三相似度,所述预设线性加权公式为:The first similarity and the second similarity are input into a preset linear weighting formula, and the third similarity between the matched text and the preset text template is output, and the preset linear weighting formula is:
sim(p,q)=αsim LDA(p,q)+βsim TFIDF(p,q), sim(p,q)=αsim LDA (p,q)+βsim TFIDF (p,q),
其中,p和q分别为所述匹配文本和所述预设文本模板,sim TFIDF(p,q)为所述第一相似度,sim LDA(p,q)为所述第二相似度,sim(p,q)为所述第三相似度,α和β为预设权重值。 Where p and q are the matching text and the preset text template respectively, sim TFIDF (p, q) is the first similarity degree, sim LDA (p, q) is the second similarity degree, sim (p, q) is the third degree of similarity, and α and β are preset weight values.
本实施例中,0≤α≤1,0≤β≤1,并且α与β之和为1。In this embodiment, 0≤α≤1, 0≤β≤1, and the sum of α and β is 1.
可选的,在本申请另一实施例中,所述方法还包括:获取用于线性加权的权重值。所述获取用于线性加权的权重值包括:Optionally, in another embodiment of the present application, the method further includes: obtaining a weight value for linear weighting. The obtaining a weight value for linear weighting includes:
对所述权重值赋予第一初始值,根据所述第一初始值计算所述第三相似度;Assigning a first initial value to the weight value, and calculating the third degree of similarity according to the first initial value;
通过预设聚类算法判断所述匹配模板与所述预设文本模板是否为相同类别,获取聚类结果;Judging whether the matching template and the preset text template are in the same category by a preset clustering algorithm, and obtaining a clustering result;
通过所述聚类结果判断根据所述第一初始值计算得到的所述第三相似度是否准确;Judging by the clustering result whether the third degree of similarity calculated according to the first initial value is accurate;
若确定根据所述第一初始值计算得到的所述第三相似度准确,确定所述第一初始值为用于线性加权的权重值;If it is determined that the third degree of similarity calculated according to the first initial value is accurate, determining that the first initial value is a weight value used for linear weighting;
若确定根据所述第一初始值计算得到的所述第三相似度不准确,调整所述第一初始值,执行所述根据所述第一初始值计算所述第三相似度的操作。If it is determined that the third similarity calculated according to the first initial value is not accurate, the first initial value is adjusted, and the operation of calculating the third similarity according to the first initial value is performed.
上述步骤用于获取α或β的值。The above steps are used to obtain the value of α or β.
所述聚类结果为匹配模板与预设文本模板是相同类别,或者匹配模板与预设文本模板不是相同类别。The clustering result is that the matching template and the preset text template are in the same category, or the matching template and the preset text template are not in the same category.
所述第一初始值可以为0.1,当调整第一初始值时,可以每次调整增加0.1。例如,若获取的权重为α,即刚开始赋值时令α为0.1,则此时β为0.9,根据预设线性加权公式计算匹配文本与预设文本模板的第三相似度,以及通过聚类算法判断匹配模板与预设文本模板是否为相同类别,若第三相似度小于50%,而聚类算法判断匹配模板与预设文本模板不为相同类别,则确定根据第一初始值计算得到的所述第三相似度不准确。令α=α+0.1,则α为0.2,此时 β为0.8,根据预设线性加权公式计算匹配文本与预设文本模板的第三相似度,以及通过聚类算法判断匹配模板与预设文本模板是否为相同类别,若不准确,令α=α+0.1,则α为0.3,此时β为0.7,再次计算,以此类推,直到找到最优的α的值与β的值。The first initial value may be 0.1. When the first initial value is adjusted, it may be increased by 0.1 each time. For example, if the obtained weight is α, that is, when the value is initially assigned, α is 0.1, and β is 0.9 at this time. The third similarity between the matching text and the preset text template is calculated according to the preset linear weighting formula, and the clustering algorithm is used Determine whether the matching template and the preset text template are in the same category. If the third similarity is less than 50%, and the clustering algorithm determines that the matching template and the preset text template are not in the same category, then it is determined whether the matching template and the preset text template are in the same category. The third similarity is not accurate. Let α=α+0.1, then α is 0.2, and β is 0.8 at this time. According to the preset linear weighting formula, the third similarity between the matching text and the preset text template is calculated, and the matching template and the preset text are judged by the clustering algorithm Whether the template is of the same category, if it is not accurate, let α=α+0.1, then α is 0.3, and β is 0.7 at this time, calculate again, and so on until the optimal value of α and β are found.
在本实施例中,当确定匹配文本为与预设文本模板相似的文本模板时,可以将匹配文本添加至预设文本模板的模板集合中,从而通过本实施例,可以得到多个文本模板集合,每个文本模板集合中都为相似的文本模板。In this embodiment, when it is determined that the matched text is a text template similar to the preset text template, the matched text can be added to the template set of the preset text template, so that through this embodiment, multiple text template sets can be obtained , Each text template collection contains similar text templates.
本实施例提出的文本模板识别方法,获取预设文本模板和匹配文本;根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。无需工作人员一一人工判断,就能够快速的获取到与预设文本模板相似的文本模块,实现了提高文本模板识别的效率的目的,并且,在计算文本相似度时,通过基于词频的文本相似度算法和/或基于语义的文本相似度算法进行计算,能够提高文本模板识别的准确度。The text template recognition method proposed in this embodiment obtains a preset text template and a matching text; calculates the first similarity between the matching text and the preset text template according to a text similarity algorithm based on word frequency; and/or according to The semantic text similarity algorithm calculates the second similarity between the matched text and the preset text template; when the first similarity and/or the second similarity meets the preset similarity condition, the The matched text is a text template similar to the preset text template. Without the need for staff to judge one by one, the text module similar to the preset text template can be quickly obtained, which achieves the purpose of improving the efficiency of text template recognition, and when calculating text similarity, the text similarity based on word frequency The calculation of the degree algorithm and/or the semantic-based text similarity algorithm can improve the accuracy of text template recognition.
本申请还提供一种文本模板识别装置。参照图2所示,为本申请一实施例提供的文本模板识别装置的内部结构示意图。The application also provides a text template recognition device. Referring to FIG. 2, it is a schematic diagram of the internal structure of a text template recognition device provided by an embodiment of this application.
在本实施例中,文本模板识别装置1可以是PC(Personal Computer,个人电脑),也可以是智能手机、平板电脑、便携计算机等终端设备。该文本模板识别装置1至少包括存储器11、处理器12,网络接口13以及通信总线14。In this embodiment, the text template recognition device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer. The text template recognition device 1 at least includes a memory 11, a processor 12, a network interface 13, and a communication bus 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是文本模板识别装置1的内部存储单元,例如该文本模板识别装置1的硬盘。存储器11在另一些实施例中也可以是文本模板识别装置1的外部存储设备,例如文本模板识别装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括文本模板识别装置1的内部存储单元也包括外部存储设备。 存储器11不仅可以用于存储安装于文本模板识别装置1的应用软件及各类数据,例如文本模板识别程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may be an internal storage unit of the text template recognition device 1 in some embodiments, such as a hard disk of the text template recognition device 1. In other embodiments, the memory 11 may also be an external storage device of the text template recognition device 1, such as a plug-in hard disk equipped on the text template recognition device 1, a smart media card (SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the text template recognition apparatus 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the text template recognition device 1, such as the code of the text template recognition program 01, etc., but also to temporarily store data that has been output or will be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行文本模板识别程序01等。The processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor or other data processing chip in some embodiments, and is used to run the program code or processing stored in the memory 11 Data, such as executing text template recognition program 01, etc.
网络接口13可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。The network interface 13 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.
通信总线14用于实现这些组件之间的连接通信。The communication bus 14 is used to realize the connection and communication between these components.
可选地,该文本模板识别装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在文本模板识别装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the text template recognition device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the text template recognition device 1 and to display a visualized user interface.
图2仅示出了具有组件11-14以及文本模板识别程序01的文本模板识别装置1,本领域技术人员可以理解的是,图2示出的结构并不构成对文本模板识别装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。Figure 2 only shows the text template recognition device 1 with components 11-14 and the text template recognition program 01. Those skilled in the art can understand that the structure shown in Figure 2 does not constitute a limitation on the text template recognition device 1 It may include fewer or more components than shown, or a combination of some components, or a different component arrangement.
在图2所示的文本模板识别装置1实施例中,存储器11中存储有文本模板识别程序01;处理器12执行存储器11中存储的文本模板识别程序01时实现如下步骤:In the embodiment of the text template recognition device 1 shown in FIG. 2, a text template recognition program 01 is stored in the memory 11; the processor 12 implements the following steps when executing the text template recognition program 01 stored in the memory 11:
获取预设文本模板和匹配文本。Get the preset text template and matching text.
所述预设文本模板可以是预先存储在预设存储区的(例如存储在电子设备的)文本模板。该预设文本模板可由用户获取并保存在预设存储区,或者,该预设文本模板由通过分析若干类似词语的文本,并提取该文本中相似的关键词,得到预设文本模板。The preset text template may be a text template pre-stored in a preset storage area (for example, stored in an electronic device). The preset text template can be obtained by the user and stored in a preset storage area, or the preset text template is obtained by analyzing several texts of similar words and extracting similar keywords in the text.
一种可能的实施例中,预设文本模板为一个文本模板集合中的任意一个文本模板,该文本集合中都为同一类文本模板,或者该文本集合中包括各种 不同类的文本模板。所述获取预设文本模板包括:获取文本模板集合;获取所述文本模板集合中的一文本模板。In a possible embodiment, the preset text template is any text template in a text template collection, and the text collection is all text templates of the same type, or the text collection includes various types of text templates. The obtaining of the preset text template includes: obtaining a text template collection; obtaining a text template in the text template collection.
所述匹配文本是需要进行判断是否为相似文本模板的文本。该匹配文本可以由一个或多个语句组成。The matched text is text that needs to be judged whether it is a similar text template. The matched text can consist of one or more sentences.
根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度。Calculate the first similarity between the matching text and the preset text template according to the word frequency-based text similarity algorithm and/or calculate the first similarity between the matching text and the preset text template according to the semantic-based text similarity algorithm Two similarity.
所述基于词频的文本相似度算法通过词的出现频率来计算两个文本之间相似度;所述基于语义的文本相似度算法通过此的语义来计算两个文本之间的相似度。The word frequency-based text similarity algorithm calculates the similarity between two texts by the appearance frequency of words; the semantic-based text similarity algorithm calculates the similarity between two texts by the semantics of this.
具体的基于词频的文本相似度算法以及所述基于语义的文本相似度算法可以从现有技术中获取,此处不再赘述。The specific word frequency-based text similarity algorithm and the semantic-based text similarity algorithm can be obtained from the prior art, and will not be repeated here.
可选的,在发明另一实施例中,所述根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度和根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度包括:Optionally, in another embodiment of the invention, the calculation of the first similarity between the matched text and the preset text template according to the word frequency-based text similarity algorithm and the calculation of the first similarity between the matching text and the preset text template according to the semantic-based text similarity algorithm The second degree of similarity between the matched text and the preset text template includes:
利用向量空间模型计算所述匹配文本与所述预设文本模板的第一相似度;Calculating the first similarity between the matched text and the preset text template by using a vector space model;
利用LDA文档主题生成模型计算所述匹配文本与所述预设文本模型的第二相似度。The LDA document topic generation model is used to calculate the second similarity between the matching text and the preset text model.
在本实施例中利用向量空间模型计算匹配文本与预设文本模板的第一相似度。利用所述向量空间模型(Vector Space Model,SVM)计算匹配文本与预设文本模板的第一相似度包括:In this embodiment, the vector space model is used to calculate the first similarity between the matched text and the preset text template. Using the Vector Space Model (SVM) to calculate the first similarity between the matched text and the preset text template includes:
对匹配文本和预设文本模板进行预处理操作,所述预处理操作包括但不限于分词、去停用词(包括对文本内容意义不大的词、符号、标点、乱码等,如“这”“的”“呀”等),得到预处理后的匹配文本和预处理后的预设文本模板;Perform preprocessing operations on the matched text and preset text templates. The preprocessing operations include, but are not limited to, word segmentation and stop-word removal (including words, symbols, punctuation, garbled characters that have little meaning to the text content, such as "this" "的", "呀", etc.) to obtain the preprocessed matching text and the preprocessed preset text template;
从预处理后的匹配文本中词语的频率确定第一关键词,以及从预处理后的预设文本模板中词语的频率确定第二关键词,其中,第一关键词和第二关键词都可包含多个词语;The first keyword is determined from the frequency of words in the preprocessed matching text, and the second keyword is determined from the frequency of words in the preprocessed preset text template, where both the first keyword and the second keyword can be used Contains multiple words;
例如,确定预处理后的匹配文本中出现频率大于预设频率的词语为第一关键词。For example, it is determined that a word with a frequency greater than a preset frequency in the preprocessed matched text is the first keyword.
在确定第一关键词和第二关键词之后,计算第一关键词的逆向文本频率,以 及第二关键词的逆向文本频率,并生成表示匹配文本的第一向量和表示预设文本模板的第二向量;After determining the first keyword and the second keyword, calculate the reverse text frequency of the first keyword and the reverse text frequency of the second keyword, and generate the first vector representing the matched text and the first vector representing the preset text template Two vectors
其中,逆向文本频率(inverse document frequency,IDF)是用于衡量关键词权重的指数。Among them, inverse document frequency (IDF) is an index used to measure the weight of keywords.
某一关键词的逆向文本频率可以根据其公式IDF=log(D/D w)进行计算,其中,D为样本数据库中文本的总数量,D w为关键词出现过的文本的数量。 The reverse text frequency of a certain keyword can be calculated according to its formula IDF=log(D/D w ), where D is the total number of texts in the sample database, and D w is the number of texts where the keyword has appeared.
本实施例中,根据以下公式得到第一向量和第二向量:In this embodiment, the first vector and the second vector are obtained according to the following formula:
D=D(T1,W1;T2,W2;…,Tn,Wn)D=D(T1, W1; T2, W2;..., Tn, Wn)
其中,T1为一个关键词,W1为该关键词的逆向文本频率;T2为另一个关键词,W2为该关键词的逆向文本频率;以此类推,Tn为第n个关键词,Wn为该关键词的逆向文本频率。Among them, T1 is a keyword, W1 is the reverse text frequency of the keyword; T2 is another keyword, W2 is the reverse text frequency of the keyword; and so on, Tn is the nth keyword, Wn is the keyword The reverse text frequency of the keyword.
在向量空间模型中,两个文本之间的内容相关度Sim(D1,D2)常用向量之间夹角的余弦值表示,因此,在得到匹配文本的第一向量空间模型和预设文本模板的第二向量之后,计算第一向量与第二向量的余弦,从而得到预匹配文本与预设文本模板的第一相似度,计算余弦的公式可以从现有技术中获取,此处不再赘述。In the vector space model, the content correlation between two texts Sim (D1, D2) is usually expressed by the cosine value of the angle between the vectors. Therefore, the first vector space model of the matching text and the preset text template are obtained. After the second vector, the cosine of the first vector and the second vector is calculated to obtain the first similarity between the pre-matched text and the preset text template. The formula for calculating the cosine can be obtained from the prior art and will not be repeated here.
在本实施例中,将文本简化为以特征项(关键词)的权重为分量的N维向量进行表示,简化了文本中关键词之间的复杂关系,使模型具备了可计算性,进而能够快速得到匹配文本以及预设文本模板之间的第一相似度。。In this embodiment, the text is simplified as an N-dimensional vector with the weight of the feature item (keyword) as the component, which simplifies the complex relationship between keywords in the text, makes the model computable, and can Quickly obtain the first similarity between the matched text and the preset text template. .
本实施例中,LDA(Latent Dirichlet Allocation,隐含狄利克雷分布)模型的基本思想是将文档描述为主题概率分布并进一步将主题描述为词项概率分布。具体的,如何根据LDA文档主题生成模型计算匹配文本与预设文本模型的第二相似度可以从现有技术中获取,此处不再赘述。In this embodiment, the basic idea of the LDA (Latent Dirichlet Allocation, implicit Dirichlet distribution) model is to describe the document as a topic probability distribution and further describe the topic as a term probability distribution. Specifically, how to calculate the second similarity between the matched text and the preset text model according to the LDA document topic generation model can be obtained from the prior art, and will not be repeated here.
当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。When the first degree of similarity and/or the second degree of similarity satisfy a preset similarity condition, it is determined that the matching text is a text template similar to the preset text template.
所述预设相似度条件可以为预先设置的。The preset similarity condition may be preset.
可选的,在本申请另一实施例中,所述第一相似度或所述第二相似度满足预设相似度条件包括:Optionally, in another embodiment of the present application, that the first similarity or the second similarity meets a preset similarity condition includes:
所述第一相似度大于第一预设相似度或所述第二相似度大于第二预设相似度。The first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
所述第一预设相似度和所述第二预设相似度可以根据需要预先设置,所述第一预设相似度和所述第二预设相似度的值可以相同或者不同。例如,第一预设相似度为85%,第二预设相似度为90%;或者,第一预设相似度和所述第二预设相似度都为90%。The first preset similarity degree and the second preset similarity degree may be preset as required, and the values of the first preset similarity degree and the second preset similarity degree may be the same or different. For example, the first preset similarity is 85%, and the second preset similarity is 90%; or, both the first preset similarity and the second preset similarity are 90%.
可选的,在本申请另一实施例中,所述第一相似度和所述第二相似度满足预设相似度条件包括:Optionally, in another embodiment of the present application, that the first similarity and the second similarity satisfy a preset similarity condition includes:
根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度;Performing linear weighting according to the first similarity and the second similarity to obtain the third similarity between the matched text and the preset text template;
判断所述第三相似度是否大于第三预设相似度;Judging whether the third similarity is greater than the third preset similarity;
若所述第三相似度大于所述第三预设相似度,确定所述第一相似度和所述第二相似度满足预设相似条件。If the third similarity is greater than the third preset similarity, it is determined that the first similarity and the second similarity satisfy a preset similarity condition.
线性加权即对第一相似度和第二相似度赋予一定的权重值再相加,得到第三相似度。Linear weighting is to give a certain weight value to the first similarity and the second similarity and then add them to obtain the third similarity.
所述第三预设相似度可以是预先设置的。The third preset similarity degree may be preset.
可选的,在本申请另一实施例中,所述根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度包括:Optionally, in another embodiment of the present application, the linear weighting is performed according to the first similarity and the second similarity to obtain the third similarity between the matched text and the preset text template include:
将所述第一相似度、所述第二相似度输入至预设线性加权公式,输出所述匹配文本与所述预设文本模板的第三相似度,所述预设线性加权公式为:The first similarity and the second similarity are input into a preset linear weighting formula, and the third similarity between the matched text and the preset text template is output, and the preset linear weighting formula is:
sim(p,q)=αsim LDA(p,q)+βsim TFIDF(p,q), sim(p,q)=αsim LDA (p,q)+βsim TFIDF (p,q),
其中,p和q分别为所述匹配文本和所述预设文本模板,sim TFIDF(p,q)为所述第一相似度,sim LDA(p,q)为所述第二相似度,sim(p,q)为所述第三相似度,α和β为预设权重值。 Where p and q are the matching text and the preset text template respectively, sim TFIDF (p, q) is the first similarity degree, sim LDA (p, q) is the second similarity degree, sim (p, q) is the third degree of similarity, and α and β are preset weight values.
本实施例中,0≤α≤1,0≤β≤1,并且α与β之和为1。In this embodiment, 0≤α≤1, 0≤β≤1, and the sum of α and β is 1.
可选的,在本申请另一实施例中,所述文本模板识别程序被所述处理器执行,还实现如下步骤:Optionally, in another embodiment of the present application, the text template recognition program is executed by the processor, and the following steps are further implemented:
获取用于线性加权的权重值。Get the weight value used for linear weighting.
所述获取用于线性加权的权重值包括:The obtaining a weight value for linear weighting includes:
对所述权重值赋予第一初始值,根据所述第一初始值计算所述第三相似度;Assigning a first initial value to the weight value, and calculating the third degree of similarity according to the first initial value;
通过预设聚类算法判断所述匹配模板与所述预设文本模板是否为相同类别,获取聚类结果;Judging whether the matching template and the preset text template are in the same category by a preset clustering algorithm, and obtaining a clustering result;
通过所述聚类结果判断根据所述第一初始值计算得到的所述第三相似度是否准确;Judging by the clustering result whether the third degree of similarity calculated according to the first initial value is accurate;
若确定根据所述第一初始值计算得到的所述第三相似度准确,确定所述第一初始值为用于线性加权的权重值;If it is determined that the third degree of similarity calculated according to the first initial value is accurate, determining that the first initial value is a weight value used for linear weighting;
若确定根据所述第一初始值计算得到的所述第三相似度不准确,调整所述第一初始值,执行所述根据所述第一初始值计算所述第三相似度的操作。If it is determined that the third similarity calculated according to the first initial value is not accurate, the first initial value is adjusted, and the operation of calculating the third similarity according to the first initial value is performed.
上述步骤用于获取α或β的值。The above steps are used to obtain the value of α or β.
所述聚类结果为匹配模板与预设文本模板是相同类别,或者匹配模板与预设文本模板不是相同类别。The clustering result is that the matching template and the preset text template are in the same category, or the matching template and the preset text template are not in the same category.
所述第一初始值可以为0.1,当调整第一初始值时,可以每次调整增加0.1。例如,若获取的权重为α,即刚开始赋值时令α为0.1,则此时β为0.9,根据预设线性加权公式计算匹配文本与预设文本模板的第三相似度,以及通过聚类算法判断匹配模板与预设文本模板是否为相同类别,若第三相似度小于50%,而聚类算法判断匹配模板与预设文本模板不为相同类别,则确定根据第一初始值计算得到的所述第三相似度不准确。令α=α+0.1,则α为0.2,此时β为0.8,根据预设线性加权公式计算匹配文本与预设文本模板的第三相似度,以及通过聚类算法判断匹配模板与预设文本模板是否为相同类别,若不准确,令α=α+0.1,则α为0.3,此时β为0.7,再次计算,以此类推,直到找到最优的α的值与β的值。The first initial value may be 0.1. When the first initial value is adjusted, it may be increased by 0.1 each time. For example, if the obtained weight is α, that is, when the value is initially assigned, α is 0.1, and β is 0.9 at this time. The third similarity between the matching text and the preset text template is calculated according to the preset linear weighting formula, and the clustering algorithm is used Determine whether the matching template and the preset text template are in the same category. If the third similarity is less than 50%, and the clustering algorithm determines that the matching template and the preset text template are not in the same category, then it is determined whether the matching template and the preset text template are in the same category. The third similarity is not accurate. Let α=α+0.1, then α is 0.2, and β is 0.8 at this time. According to the preset linear weighting formula, the third similarity between the matching text and the preset text template is calculated, and the matching template and the preset text are judged by the clustering algorithm Whether the template is of the same category, if it is not accurate, let α=α+0.1, then α is 0.3, and β is 0.7 at this time, calculate again, and so on until the optimal value of α and β are found.
在本实施例中,当确定匹配文本为与预设文本模板相似的文本模板时,可以将匹配文本添加至预设文本模板的模板集合中,从而通过本实施例,可以得到多个文本模板集合,每个文本模板集合中都为相似的文本模板。In this embodiment, when it is determined that the matched text is a text template similar to the preset text template, the matched text can be added to the template set of the preset text template, so that through this embodiment, multiple text template sets can be obtained , Each text template collection contains similar text templates.
本实施例提出的文本模板识别装置,获取预设文本模板和匹配文本;根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。无需工作人员一一人工判断,就能够快速的获取到与预设文本模板相似的文本 模块,实现了提高文本模板识别的效率的目的,并且,在计算文本相似度时,通过基于词频的文本相似度算法和/或基于语义的文本相似度算法进行计算,能够提高文本模板获识别的准确度。The text template recognition device proposed in this embodiment obtains a preset text template and a matching text; calculates the first similarity between the matching text and the preset text template according to a text similarity algorithm based on word frequency; and/or according to The semantic text similarity algorithm calculates the second similarity between the matched text and the preset text template; when the first similarity and/or the second similarity meets the preset similarity condition, the The matched text is a text template similar to the preset text template. Without the need for staff to judge one by one, the text module similar to the preset text template can be quickly obtained, which achieves the purpose of improving the efficiency of text template recognition, and when calculating text similarity, the text similarity based on word frequency The calculation of the degree algorithm and/or the semantic-based text similarity algorithm can improve the accuracy of the recognition of the text template.
可选地,在其他实施例中,文本模板识别程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述文本模板识别程序在文本模板识别装置中的执行过程。Optionally, in other embodiments, the text template recognition program may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (in this embodiment, The processor 12) is executed to complete the application. The module referred to in the application refers to a series of computer program instruction segments capable of completing specific functions, and is used to describe the execution process of the text template recognition program in the text template recognition device.
例如,参照图3所示,为本申请文本模板识别装置一实施例中的文本模板识别程序01的程序模块示意图,该实施例中,文本模板识别程序可以被分割为获取模块10、计算模块20和确定模块30,示例性地:For example, referring to FIG. 3, a schematic diagram of the program modules of the text template recognition program 01 in an embodiment of the text template recognition device of this application. In this embodiment, the text template recognition program can be divided into an acquisition module 10 and a calculation module 20. And determining module 30, exemplarily:
获取模块10用于:获取预设文本模板和匹配文本;The obtaining module 10 is used for: obtaining a preset text template and matching text;
计算模块20用于:根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;The calculation module 20 is configured to: calculate the first similarity between the matching text and the preset text template according to a text similarity algorithm based on word frequency; and/or calculate the matching text and the matching text according to a semantic-based text similarity algorithm State the second similarity of the preset text template;
确定模块30用于:当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。The determining module 30 is configured to determine that the matched text is a text template similar to the preset text template when the first similarity degree and/or the second similarity degree satisfy a preset similarity degree condition.
上述获取模块10、计算模块20和确定模块30等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The functions or operation steps implemented by the program modules such as the acquisition module 10, the calculation module 20, and the determination module 30 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有文本模板识别程序,所述文本模板识别程序可被一个或多个处理器执行,以实现如下操作:In addition, an embodiment of the present application also proposes a computer-readable storage medium that stores a text template recognition program on the computer-readable storage medium, and the text template recognition program can be executed by one or more processors to achieve the following operating:
获取预设文本模板和匹配文本;Obtain the preset text template and matching text;
根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或Calculate the first similarity between the matched text and the preset text template according to a text similarity algorithm based on word frequency; and/or
根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;Calculating the second similarity between the matched text and the preset text template according to a semantic-based text similarity algorithm;
当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。When the first degree of similarity and/or the second degree of similarity satisfy a preset similarity condition, it is determined that the matching text is a text template similar to the preset text template.
本申请计算机可读存储介质具体实施方式与上述文本模板识别装置和方法各实施例基本相同,在此不作累述。The specific implementation of the computer-readable storage medium of the present application is basically the same as the embodiments of the text template recognition device and method described above, and will not be repeated here.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the serial numbers of the above embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种文本模板识别方法,其特征在于,所述方法包括:A text template recognition method, characterized in that the method includes:
    获取预设文本模板和匹配文本;Obtain the preset text template and matching text;
    根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或Calculate the first similarity between the matched text and the preset text template according to a text similarity algorithm based on word frequency; and/or
    根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;Calculating the second similarity between the matched text and the preset text template according to a semantic-based text similarity algorithm;
    当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。When the first degree of similarity and/or the second degree of similarity satisfy a preset similarity condition, it is determined that the matching text is a text template similar to the preset text template.
  2. 如权利要求1所述的文本模板识别方法,其特征在于,其特征在于,所述根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度包括:The text template recognition method according to claim 1, wherein the first similarity and/or the first similarity between the matched text and the preset text template is calculated according to a text similarity algorithm based on word frequency. Calculating the second similarity between the matched text and the preset text template according to the semantic-based text similarity algorithm includes:
    利用向量空间模型计算所述匹配文本与所述预设文本模板的第一相似度;Calculating the first similarity between the matched text and the preset text template by using a vector space model;
    利用LDA文档主题生成模型计算所述匹配文本与所述预设文本模型的第二相似度;Calculating the second similarity between the matching text and the preset text model by using the LDA document topic generation model;
    所述第一相似度和所述第二相似度满足预设相似度条件包括:The first similarity degree and the second similarity degree satisfying a preset similarity degree condition includes:
    根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度;Performing linear weighting according to the first similarity and the second similarity to obtain the third similarity between the matched text and the preset text template;
    判断所述第三相似度是否大于第三预设相似度;Judging whether the third similarity is greater than the third preset similarity;
    若所述第三相似度大于所述预设相似度,确定所述第一相似度和所述第二相似度满足预设相似条件。If the third similarity is greater than the preset similarity, it is determined that the first similarity and the second similarity satisfy a preset similarity condition.
  3. 如权利要求2所述的文本模板识别方法,其特征在于,所述根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度包括:The text template recognition method according to claim 2, wherein the linear weighting is performed according to the first similarity and the second similarity to obtain the first of the matched text and the preset text template Three similarities include:
    将所述第一相似度、所述第二相似度输入至预设线性加权公式,输出所述匹配文本与所述预设文本模板的第三相似度,所述预设线性加权公式为:The first similarity and the second similarity are input into a preset linear weighting formula, and the third similarity between the matched text and the preset text template is output, and the preset linear weighting formula is:
    sim(p,q)=αsim LDA(p,q)+βsim TFIDF(p,q), sim(p,q)=αsim LDA (p,q)+βsim TFIDF (p,q),
    其中,p和q分别为所述匹配文本和所述预设文本模板,sim TFIDF(p,q)为所 述第一相似度,sim LDA(p,q)为所述第二相似度,sim(p,q)为所述第三相似度,α和β为预设权重值。 Where p and q are the matching text and the preset text template respectively, sim TFIDF (p, q) is the first similarity degree, sim LDA (p, q) is the second similarity degree, sim (p, q) is the third degree of similarity, and α and β are preset weight values.
  4. 如权利要求2所述的文本模板识别方法,其特征在于,所述方法还包括:3. The text template recognition method of claim 2, wherein the method further comprises:
    获取用于线性加权的权重值,包括:Get the weight value used for linear weighting, including:
    对所述权重值赋予第一初始值,根据所述第一初始值计算所述第三相似度;Assigning a first initial value to the weight value, and calculating the third degree of similarity according to the first initial value;
    通过预设聚类算法判断所述匹配模板与所述预设文本模板是否为相同类别,获取聚类结果;Judging whether the matching template and the preset text template are in the same category by a preset clustering algorithm, and obtaining a clustering result;
    通过所述聚类结果判断根据所述第一初始值计算得到的所述第三相似度是否准确;Judging by the clustering result whether the third degree of similarity calculated according to the first initial value is accurate;
    若确定根据所述第一初始值计算得到的所述第三相似度准确,确定所述第一初始值为用于线性加权的权重值;If it is determined that the third degree of similarity calculated according to the first initial value is accurate, determining that the first initial value is a weight value used for linear weighting;
    若确定根据所述第一初始值计算得到的所述第三相似度准确不准确,调整所述第一初始值,执行所述根据所述第一初始值计算所述第三相似度的操作。If it is determined that the third degree of similarity calculated according to the first initial value is not accurate, adjust the first initial value, and perform the operation of calculating the third degree of similarity according to the first initial value.
  5. 如权利要求3所述的文本模板识别方法,其特征在于,所述方法还包括:5. The text template recognition method of claim 3, wherein the method further comprises:
    获取用于线性加权的权重值,包括:Get the weight value used for linear weighting, including:
    对所述权重值赋予第一初始值,根据所述第一初始值计算所述第三相似度;Assigning a first initial value to the weight value, and calculating the third degree of similarity according to the first initial value;
    通过预设聚类算法判断所述匹配模板与所述预设文本模板是否为相同类别,获取聚类结果;Judging whether the matching template and the preset text template are in the same category by a preset clustering algorithm, and obtaining a clustering result;
    通过所述聚类结果判断根据所述第一初始值计算得到的所述第三相似度是否准确;Judging by the clustering result whether the third degree of similarity calculated according to the first initial value is accurate;
    若确定根据所述第一初始值计算得到的所述第三相似度准确,确定所述第一初始值为用于线性加权的权重值;If it is determined that the third degree of similarity calculated according to the first initial value is accurate, determining that the first initial value is a weight value used for linear weighting;
    若确定根据所述第一初始值计算得到的所述第三相似度准确不准确,调整所述第一初始值,执行所述根据所述第一初始值计算所述第三相似度的操作。If it is determined that the third degree of similarity calculated according to the first initial value is not accurate, adjust the first initial value, and perform the operation of calculating the third degree of similarity according to the first initial value.
  6. 如权利要求1所述的文本模板识别方法,其特征在于,所述第一相似度或所述第二相似度满足预设相似度条件包括:5. The text template recognition method according to claim 1, wherein the first similarity or the second similarity satisfying a preset similarity condition comprises:
    所述第一相似度大于第一预设相似度或所述第二相似度大于第二预设相似度。The first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
  7. 如权利要求2-5任一项所述的文本模板识别方法,其特征在于,所述第一相似度或所述第二相似度满足预设相似度条件包括:5. The text template recognition method according to any one of claims 2-5, wherein the first similarity or the second similarity satisfying a preset similarity condition comprises:
    所述第一相似度大于第一预设相似度或所述第二相似度大于第二预设相似度。The first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
  8. 一种文本模板识别装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的文本模板识别程序,所述文本模板识别程序被所述处理器执行时实现如下步骤:A text template recognition device, characterized in that the device comprises a memory and a processor, the memory stores a text template recognition program that can be run on the processor, and the text template recognition program is processed by the processor. The following steps are implemented when the device is executed:
    获取预设文本模板和匹配文本;Obtain the preset text template and matching text;
    根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或Calculate the first similarity between the matched text and the preset text template according to a text similarity algorithm based on word frequency; and/or
    根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;Calculating the second similarity between the matched text and the preset text template according to a semantic-based text similarity algorithm;
    当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。When the first degree of similarity and/or the second degree of similarity satisfy a preset similarity condition, it is determined that the matching text is a text template similar to the preset text template.
  9. 如权利要求8所述的文本模板识别装置,其特征在于,所述根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度包括:8. The text template recognition device according to claim 8, wherein the first similarity between the matched text and the preset text template is calculated according to a text similarity algorithm based on word frequency and/or according to semantic-based The text similarity algorithm calculating the second similarity between the matched text and the preset text template includes:
    利用向量空间模型计算所述匹配文本与所述预设文本模板的第一相似度;Calculating the first similarity between the matched text and the preset text template by using a vector space model;
    利用LDA文档主题生成模型计算所述匹配文本与所述预设文本模型的第二相似度;Calculating the second similarity between the matching text and the preset text model by using the LDA document topic generation model;
    所述第一相似度和所述第二相似度满足预设相似度条件包括:The first similarity degree and the second similarity degree satisfying a preset similarity degree condition includes:
    根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度;Performing linear weighting according to the first similarity and the second similarity to obtain the third similarity between the matched text and the preset text template;
    判断所述第三相似度是否大于第三预设相似度;Judging whether the third similarity is greater than the third preset similarity;
    若所述第三相似度大于所述预设相似度,确定所述第一相似度和所述第二相似度满足预设相似条件。If the third similarity is greater than the preset similarity, it is determined that the first similarity and the second similarity satisfy a preset similarity condition.
  10. 如权利要求9所述的文本模板识别装置,其特征在于,所述根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度包括:The text template recognition device according to claim 9, wherein the linear weighting is performed according to the first similarity and the second similarity to obtain the first of the matched text and the preset text template Three similarities include:
    将所述第一相似度、所述第二相似度输入至预设线性加权公式,输出所述匹配文本与所述预设文本模板的第三相似度,所述预设线性加权公式为:The first similarity and the second similarity are input into a preset linear weighting formula, and the third similarity between the matched text and the preset text template is output, and the preset linear weighting formula is:
    sim(p,q)=αsim LDA(p,q)+βsim TFIDF(p,q), sim(p,q)=αsim LDA (p,q)+βsim TFIDF (p,q),
    其中,p和q分别为所述匹配文本和所述预设文本模板,sim TFIDF(p,q)为所述第一相似度,sim LDA(p,q)为所述第二相似度,sim(p,q)为所述第三相似度,α和β为预设权重值。 Where p and q are the matching text and the preset text template respectively, sim TFIDF (p, q) is the first similarity degree, sim LDA (p, q) is the second similarity degree, sim (p, q) is the third degree of similarity, and α and β are preset weight values.
  11. 如权利要求9所述的文本模板识别装置,其特征在于,所述文本模板识别程序被所述处理器执行,还实现如下步骤:9. The text template recognition device of claim 9, wherein the text template recognition program is executed by the processor, and further implements the following steps:
    获取用于线性加权的权重值,包括:Get the weight value used for linear weighting, including:
    对所述权重值赋予第一初始值,根据所述第一初始值计算所述第三相似度;Assigning a first initial value to the weight value, and calculating the third degree of similarity according to the first initial value;
    通过预设聚类算法判断所述匹配模板与所述预设文本模板是否为相同类别,获取聚类结果;Judging whether the matching template and the preset text template are in the same category by a preset clustering algorithm, and obtaining a clustering result;
    通过所述聚类结果判断根据所述第一初始值计算得到的所述第三相似度是否准确;Judging by the clustering result whether the third degree of similarity calculated according to the first initial value is accurate;
    若确定根据所述第一初始值计算得到的所述第三相似度准确,确定所述第一初始值为用于线性加权的权重值;If it is determined that the third degree of similarity calculated according to the first initial value is accurate, determining that the first initial value is a weight value used for linear weighting;
    若确定根据所述第一初始值计算得到的所述第三相似度准确不准确,调整所述第一初始值,执行所述根据所述第一初始值计算所述第三相似度的操作。If it is determined that the third degree of similarity calculated according to the first initial value is not accurate, adjust the first initial value, and perform the operation of calculating the third degree of similarity according to the first initial value.
  12. 如权利要求10所述的文本模板识别装置,其特征在于,所述文本模板识别程序被所述处理器执行,还实现如下步骤:9. The text template recognition device of claim 10, wherein the text template recognition program is executed by the processor, and further implements the following steps:
    获取用于线性加权的权重值,包括:Get the weight value used for linear weighting, including:
    对所述权重值赋予第一初始值,根据所述第一初始值计算所述第三相似度;Assigning a first initial value to the weight value, and calculating the third degree of similarity according to the first initial value;
    通过预设聚类算法判断所述匹配模板与所述预设文本模板是否为相同类别,获取聚类结果;Judging whether the matching template and the preset text template are in the same category by a preset clustering algorithm, and obtaining a clustering result;
    通过所述聚类结果判断根据所述第一初始值计算得到的所述第三相似度是否准确;Judging by the clustering result whether the third degree of similarity calculated according to the first initial value is accurate;
    若确定根据所述第一初始值计算得到的所述第三相似度准确,确定所述第一初始值为用于线性加权的权重值;If it is determined that the third degree of similarity calculated according to the first initial value is accurate, determining that the first initial value is a weight value used for linear weighting;
    若确定根据所述第一初始值计算得到的所述第三相似度准确不准确,调整所述第一初始值,执行所述根据所述第一初始值计算所述第三相似度的操作。If it is determined that the third degree of similarity calculated according to the first initial value is not accurate, adjust the first initial value, and perform the operation of calculating the third degree of similarity according to the first initial value.
  13. 如权利要求8所述的文本模板识别装置,其特征在于,所述第一相似度或所述第二相似度满足预设相似度条件包括:8. The text template recognition device of claim 8, wherein the first similarity or the second similarity satisfying a preset similarity condition comprises:
    所述第一相似度大于第一预设相似度或所述第二相似度大于第二预设相似度。The first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
  14. 如权利要求9-12任一项所述的文本模板识别装置,其特征在于,所述第一相似度或所述第二相似度满足预设相似度条件包括:11. The text template recognition device according to any one of claims 9-12, wherein the first similarity or the second similarity satisfying a preset similarity condition comprises:
    所述第一相似度大于第一预设相似度或所述第二相似度大于第二预设相似度。The first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有文本模板识别程序,所述文本模板识别程序可被一个或者多个处理器执行,以实现如下步骤:A computer-readable storage medium, characterized in that a text template recognition program is stored on the computer-readable storage medium, and the text template recognition program can be executed by one or more processors to implement the following steps:
    获取预设文本模板和匹配文本;Obtain the preset text template and matching text;
    根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度;和/或Calculate the first similarity between the matched text and the preset text template according to a text similarity algorithm based on word frequency; and/or
    根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度;Calculating the second similarity between the matched text and the preset text template according to a semantic-based text similarity algorithm;
    当所述第一相似度和/或所述第二相似度满足预设相似度条件时,确定所述匹配文本为与所述预设文本模板相似的文本模板。When the first degree of similarity and/or the second degree of similarity satisfy a preset similarity condition, it is determined that the matching text is a text template similar to the preset text template.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述根据基于词频的文本相似度算法计算所述匹配文本与所述预设文本模板的第一相似度和/或根据基于语义的文本相似度算法计算所述匹配文本与所述预设文本模板的第二相似度包括:The computer-readable storage medium according to claim 15, wherein the first similarity between the matched text and the preset text template is calculated according to a text similarity algorithm based on word frequency and/or according to semantics-based The text similarity algorithm of Calculating the second similarity between the matched text and the preset text template includes:
    利用向量空间模型计算所述匹配文本与所述预设文本模板的第一相似度;Calculating the first similarity between the matched text and the preset text template by using a vector space model;
    利用LDA文档主题生成模型计算所述匹配文本与所述预设文本模型的第二相似度;Calculating the second similarity between the matching text and the preset text model by using the LDA document topic generation model;
    所述第一相似度和所述第二相似度满足预设相似度条件包括:The first similarity degree and the second similarity degree satisfying a preset similarity degree condition includes:
    根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度;Performing linear weighting according to the first similarity and the second similarity to obtain the third similarity between the matched text and the preset text template;
    判断所述第三相似度是否大于第三预设相似度;Judging whether the third similarity is greater than the third preset similarity;
    若所述第三相似度大于所述预设相似度,确定所述第一相似度和所述第二相似度满足预设相似条件。If the third similarity is greater than the preset similarity, it is determined that the first similarity and the second similarity satisfy a preset similarity condition.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述根据所述第一相似度与所述第二相似度进行线性加权,得到所述匹配文本与所述预设文本模板的第三相似度包括:The computer-readable storage medium according to claim 16, wherein the linear weighting is performed according to the first similarity and the second similarity to obtain the difference between the matching text and the preset text template The third degree of similarity includes:
    将所述第一相似度、所述第二相似度输入至预设线性加权公式,输出所述匹配文本与所述预设文本模板的第三相似度,所述预设线性加权公式为:The first similarity and the second similarity are input into a preset linear weighting formula, and the third similarity between the matched text and the preset text template is output, and the preset linear weighting formula is:
    sim(p,q)=αsim LDA(p,q)+βsim TFIDF(p,q), sim(p,q)=αsim LDA (p,q)+βsim TFIDF (p,q),
    其中,p和q分别为所述匹配文本和所述预设文本模板,sim TFIDF(p,q)为所述第一相似度,sim LDA(p,q)为所述第二相似度,sim(p,q)为所述第三相似度,α和β为预设权重值。 Where p and q are the matching text and the preset text template respectively, sim TFIDF (p, q) is the first similarity degree, sim LDA (p, q) is the second similarity degree, sim (p, q) is the third degree of similarity, and α and β are preset weight values.
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述文本模板识别程序被所述处理器执行,还实现如下步骤:16. The computer-readable storage medium of claim 16, wherein the text template recognition program is executed by the processor, and further implements the following steps:
    获取用于线性加权的权重值,包括:Get the weight value used for linear weighting, including:
    对所述权重值赋予第一初始值,根据所述第一初始值计算所述第三相似度;Assigning a first initial value to the weight value, and calculating the third degree of similarity according to the first initial value;
    通过预设聚类算法判断所述匹配模板与所述预设文本模板是否为相同类别,获取聚类结果;Judging whether the matching template and the preset text template are in the same category by a preset clustering algorithm, and obtaining a clustering result;
    通过所述聚类结果判断根据所述第一初始值计算得到的所述第三相似度是否准确;Judging by the clustering result whether the third degree of similarity calculated according to the first initial value is accurate;
    若确定根据所述第一初始值计算得到的所述第三相似度准确,确定所述第一初始值为用于线性加权的权重值;If it is determined that the third degree of similarity calculated according to the first initial value is accurate, determining that the first initial value is a weight value used for linear weighting;
    若确定根据所述第一初始值计算得到的所述第三相似度准确不准确,调整所述第一初始值,执行所述根据所述第一初始值计算所述第三相似度的操 作。If it is determined that the third degree of similarity calculated according to the first initial value is not accurate, adjust the first initial value, and perform the operation of calculating the third degree of similarity according to the first initial value.
  19. 如权利要求17所述的计算机可读存储介质,其特征在于,所述文本模板识别程序被所述处理器执行,还实现如下步骤:18. The computer-readable storage medium according to claim 17, wherein the text template recognition program is executed by the processor, and the following steps are further implemented:
    获取用于线性加权的权重值,包括:Get the weight value used for linear weighting, including:
    对所述权重值赋予第一初始值,根据所述第一初始值计算所述第三相似度;Assigning a first initial value to the weight value, and calculating the third degree of similarity according to the first initial value;
    通过预设聚类算法判断所述匹配模板与所述预设文本模板是否为相同类别,获取聚类结果;Judging whether the matching template and the preset text template are in the same category by a preset clustering algorithm, and obtaining a clustering result;
    通过所述聚类结果判断根据所述第一初始值计算得到的所述第三相似度是否准确;Judging by the clustering result whether the third degree of similarity calculated according to the first initial value is accurate;
    若确定根据所述第一初始值计算得到的所述第三相似度准确,确定所述第一初始值为用于线性加权的权重值;If it is determined that the third degree of similarity calculated according to the first initial value is accurate, determining that the first initial value is a weight value used for linear weighting;
    若确定根据所述第一初始值计算得到的所述第三相似度准确不准确,调整所述第一初始值,执行所述根据所述第一初始值计算所述第三相似度的操作。If it is determined that the third degree of similarity calculated according to the first initial value is not accurate, adjust the first initial value, and perform the operation of calculating the third degree of similarity according to the first initial value.
  20. 如权利要求16-18任一项所述的计算机可读存储介质,其特征在于,所述第一相似度或所述第二相似度满足预设相似度条件包括:18. The computer-readable storage medium according to any one of claims 16-18, wherein the first similarity or the second similarity satisfying a preset similarity condition comprises:
    所述第一相似度大于第一预设相似度或所述第二相似度大于第二预设相似度。The first similarity is greater than the first preset similarity or the second similarity is greater than the second preset similarity.
PCT/CN2019/088628 2019-02-11 2019-05-27 Text template recognition method and apparatus, and computer readable storage medium WO2020164204A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910109887.0 2019-02-11
CN201910109887.0A CN109977995A (en) 2019-02-11 2019-02-11 Text template recognition methods, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2020164204A1 true WO2020164204A1 (en) 2020-08-20

Family

ID=67076907

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088628 WO2020164204A1 (en) 2019-02-11 2019-05-27 Text template recognition method and apparatus, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN109977995A (en)
WO (1) WO2020164204A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724738A (en) * 2021-08-31 2021-11-30 平安普惠企业管理有限公司 Voice processing method, decision tree model training method, device, equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033640B (en) * 2021-03-16 2023-08-15 深圳棱镜空间智能科技有限公司 Template matching method, device, equipment and computer readable storage medium
CN113780449B (en) * 2021-09-16 2023-08-25 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN107992477A (en) * 2017-11-30 2018-05-04 北京神州泰岳软件股份有限公司 Text subject determines method, apparatus and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3690216B2 (en) * 1999-11-26 2005-08-31 日本電気株式会社 Document similarity calculation method, system and apparatus, and recording medium recording similarity calculation program
CN103377239B (en) * 2012-04-26 2020-08-07 深圳市世纪光速信息技术有限公司 Method and device for calculating similarity between texts
CN108829799A (en) * 2018-06-05 2018-11-16 中国人民公安大学 Based on the Text similarity computing method and system for improving LDA topic model
CN109165291B (en) * 2018-06-29 2021-07-09 厦门快商通信息技术有限公司 Text matching method and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN107992477A (en) * 2017-11-30 2018-05-04 北京神州泰岳软件股份有限公司 Text subject determines method, apparatus and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724738A (en) * 2021-08-31 2021-11-30 平安普惠企业管理有限公司 Voice processing method, decision tree model training method, device, equipment and storage medium
CN113724738B (en) * 2021-08-31 2024-04-23 硅基(昆山)智能科技有限公司 Speech processing method, decision tree model training method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN109977995A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
US11449767B2 (en) Method of building a sorting model, and application method and apparatus based on the model
US9767144B2 (en) Search system with query refinement
WO2019153607A1 (en) Intelligent response method, electronic device and storage medium
CN109815333B (en) Information acquisition method and device, computer equipment and storage medium
WO2021196476A1 (en) Object recommendation method, electronic device, and storage medium
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
US20210097238A1 (en) User keyword extraction device and method, and computer-readable storage medium
CN107704512B (en) Financial product recommendation method based on social data, electronic device and medium
WO2020237856A1 (en) Smart question and answer method and apparatus based on knowledge graph, and computer storage medium
US9483460B2 (en) Automated formation of specialized dictionaries
WO2021218322A1 (en) Paragraph search method and apparatus, and electronic device and storage medium
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
US20220318275A1 (en) Search method, electronic device and storage medium
CN113449187B (en) Product recommendation method, device, equipment and storage medium based on double images
CN110362601B (en) Metadata standard mapping method, device, equipment and storage medium
WO2020253042A1 (en) Intelligent sentiment judgment method and device, and computer readable storage medium
WO2020164204A1 (en) Text template recognition method and apparatus, and computer readable storage medium
CN107992477A (en) Text subject determines method, apparatus and electronic equipment
WO2023029513A1 (en) Artificial intelligence-based search intention recognition method and apparatus, device, and medium
CN110427453B (en) Data similarity calculation method, device, computer equipment and storage medium
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
WO2018171295A1 (en) Method and apparatus for tagging article, terminal, and computer readable storage medium
US20130218876A1 (en) Method and apparatus for enhancing context intelligence in random index based system
CN111553556A (en) Business data analysis method and device, computer equipment and storage medium
CN113988157A (en) Semantic retrieval network training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19915439

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19915439

Country of ref document: EP

Kind code of ref document: A1