WO2022183923A1 - Phrase generation method and apparatus, and computer readable storage medium - Google Patents

Phrase generation method and apparatus, and computer readable storage medium Download PDF

Info

Publication number
WO2022183923A1
WO2022183923A1 PCT/CN2022/077155 CN2022077155W WO2022183923A1 WO 2022183923 A1 WO2022183923 A1 WO 2022183923A1 CN 2022077155 W CN2022077155 W CN 2022077155W WO 2022183923 A1 WO2022183923 A1 WO 2022183923A1
Authority
WO
WIPO (PCT)
Prior art keywords
phrase
participle
speech
phrases
candidate
Prior art date
Application number
PCT/CN2022/077155
Other languages
French (fr)
Chinese (zh)
Inventor
朱鹏军
巨荣辉
崔明
葛一迪
刘朋樟
Original Assignee
北京沃东天骏信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京沃东天骏信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京沃东天骏信息技术有限公司
Publication of WO2022183923A1 publication Critical patent/WO2022183923A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The present disclosure relates to the technical field of computers, and relates to a phrase generation method and apparatus, and a computer readable storage medium. The method of the present disclosure comprises: for each obtained initial phrase, determining the part-of-speech and order of each word in the initial phrase to obtain a part-of-speech combination of the initial phrase, wherein the part-of-speech combination is the part-of-speech of each word arranged according to the order of each word; selecting one or more part-of-speech combinations according to the number of times of occurrence of each part-of-speech combination; selecting, from the words of an alternative text, a word that conforms to the part-of-speech in the selected part-of-speech combination, and according to the selected part-of-speech combination, generating a phrase as an alternative phrase; and according to the closeness degree of each word in each alternative phrase, selecting the alternative phrase as a generated phrase.

Description

短语生成方法、装置和计算机可读存储介质Phrase generation method, apparatus and computer readable storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请是以CN申请号为202110234468.7,申请日为2021年3月3日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。This application is based on the CN application number 202110234468.7 and the filing date is March 3, 2021, and claims its priority. The disclosure content of this CN application is hereby incorporated into this application as a whole.
技术领域technical field
本公开涉及计算机技术领域,特别涉及一种短语生成方法、装置和计算机可读存储介质。The present disclosure relates to the field of computer technology, and in particular, to a phrase generation method, apparatus, and computer-readable storage medium.
背景技术Background technique
互联网平台上的对象经常使用一些短语来描述。例如“美白保湿”、“户外烧烤”等。这些短语可以作为对象的标签进行外露展示,也可以为搜索侧提供索引,为文本生成等生成项目提供写作素材。例如,可以通过“短语+产品词”组合构建SKU的搜索索引,这样能在引导用户搜索相关关键词时,相关产品就能够快速被锁定。Objects on Internet platforms are often described using a few phrases. For example, "whitening and moisturizing", "outdoor barbecue" and so on. These phrases can be displayed as object tags, and can also provide an index for the search side, and provide writing materials for generating projects such as text generation. For example, a search index of SKUs can be constructed through the combination of "phrase + product words", so that when users are guided to search for related keywords, related products can be quickly locked.
这些短语是两个或多个词构成一定的组合关系,又经常在不同的句子里一起使用的固定片段。目前发明人已知的互联网平台生成短语的方法为人为设置一些词语组合的规则,按照规则将词语进行组合得到短语。These phrases are fixed fragments of two or more words that form a certain combination and are often used together in different sentences. The method for generating phrases on the Internet platform known to the inventors is to artificially set some rules for word combination, and combine words according to the rules to obtain phrases.
发明内容SUMMARY OF THE INVENTION
根据本公开的一些实施例,提供的一种短语生成方法,包括:针对获取的每个初始短语,确定初始短语中各个分词的词性和顺序,得到初始短语的词性组合,其中,词性组合为按照各个分词的顺序排列的各个分词的词性;根据每种词性组合出现的次数,选取一种或多种词性组合;从备选文本的各个分词中筛选出符合选取的词性组合中的词性的分词,并按照选取的词性组合生成短语,作为备选短语;根据每个备选短语中各个分词的紧密程度,选取备选短语作为生成的短语。According to some embodiments of the present disclosure, a method for generating a phrase is provided, including: for each acquired initial phrase, determining the parts of speech and order of each participle in the initial phrase, and obtaining a part-of-speech combination of the initial phrase, wherein the part-of-speech combination is according to The part of speech of each participle arranged in order of each participle; according to the number of occurrences of each part of speech combination, one or more part of speech combinations are selected; And according to the selected part-of-speech combination, a phrase is generated as an alternative phrase; according to the closeness of each participle in each alternative phrase, an alternative phrase is selected as the generated phrase.
在一些实施例中,根据每个备选短语中各个分词的紧密程度,选取备选短语作为生成的短语包括:针对每个备选短语,根据该备选短语中各个分词分别在预设文本中出现的次数以及各个分词连续在预设文本中出现的次数,确定该备选短语中各个分词的紧密程度;选取紧密程度不低于紧密程度阈值的备选短语,作为生成的短语。In some embodiments, selecting the candidate phrase as the generated phrase according to the closeness of each participle in each candidate phrase includes: for each candidate phrase, according to each participle in the candidate phrase, respectively in the preset text The number of occurrences and the number of consecutive occurrences of each participle in the preset text determine the degree of closeness of each participle in the candidate phrase; the candidate phrase whose degree of closeness is not lower than the threshold of the degree of closeness is selected as the generated phrase.
在一些实施例中,针对每个备选短语,该备选短语中各个分词的紧密程度为各个分词连续在预设文本中出现的概率与各个分词分别在预设文本中出现的概率的乘积的比值。In some embodiments, for each candidate phrase, the degree of closeness of each participle in the candidate phrase is the product of the probability of each participle appearing continuously in the preset text and the probability of each participle appearing in the preset text respectively ratio.
在一些实施例中,根据每种词性组合出现的次数,选取一种或多种词性组合包括:针对每种词性组合,根据该词性组合出现的次数、各个词性组合出现的次数中的最大次数和最小次数,确定该词性组合的权重;选取权重不低于权重阈值的一种或多种词性组合。In some embodiments, selecting one or more part-of-speech combinations according to the number of occurrences of each part-of-speech combination includes: for each part-of-speech combination, according to the number of occurrences of the part-of-speech combination, the maximum number of occurrences of each part-of-speech combination, and The minimum number of times determines the weight of the part-of-speech combination; select one or more part-of-speech combinations whose weight is not lower than the weight threshold.
在一些实施例中,该方法还包括:在生成的短语中包括具有相同分词且分词的顺序不同的多个短语的情况下,确定多个短语中每个短语的分词序列出现的概率;根据各个短语的分词序列出现的概率,确定各个短语的通顺度;根据各个短语的通顺度,选取一个或多个短语,更新为生成的短语。In some embodiments, the method further includes: when the generated phrase includes multiple phrases with the same participle and different order of the participles, determining a probability of occurrence of the participle sequence of each phrase in the multiple phrases; The probability of occurrence of the word segmentation sequence of the phrase is used to determine the fluency of each phrase; according to the fluency of each phrase, one or more phrases are selected and updated to the generated phrase.
在一些实施例中,确定多个短语中每个短语的分词序列出现的概率包括:将每个短语的分词序列输入预先训练的自然语言处理模型,得到每个短语的分词序列出现的概率。In some embodiments, determining the occurrence probability of the word segmentation sequence of each phrase in the plurality of phrases includes: inputting the word segmentation sequence of each phrase into a pre-trained natural language processing model to obtain the occurrence probability of the word segmentation sequence of each phrase.
在一些实施例中,根据各个短语的分词序列出现的概率,确定各个短语的通顺度包括:针对每个短语,将该短语的分词序列出现的概率的倒数,按照分词的个数开方,得到该短语的通顺度。In some embodiments, determining the fluency of each phrase according to the probability of occurrence of the word segmentation sequence of each phrase includes: for each phrase, taking the inverse of the probability of the occurrence of the word segmentation sequence of the phrase, and taking the square of the number of word segmentations to obtain The fluency of the phrase.
在一些实施例中,该方法还包括:根据训练语料中的各个短语与种子短语的相似性,选取训练语料中的多个短语作为初始短语。In some embodiments, the method further includes: selecting a plurality of phrases in the training corpus as initial phrases according to the similarity between each phrase in the training corpus and the seed phrase.
在一些实施例中,根据训练语料中的各个短语与种子短语的相似性,选取训练语料中的多个短语作为初始短语包括:分别确定训练语料中的各个短语和种子短语的向量;根据训练语料中的各个短语的向量和种子短语的向量的相似度,确定训练语料中的各个短语与种子短语的相似性;选取与种子短语的相似性不低于相似性阈值的多个短语,作为初始短语。In some embodiments, according to the similarity between each phrase in the training corpus and the seed phrase, selecting a plurality of phrases in the training corpus as initial phrases includes: respectively determining the vectors of each phrase and the seed phrase in the training corpus; The similarity between the vectors of each phrase in the training corpus and the vector of the seed phrase is determined to determine the similarity between each phrase in the training corpus and the seed phrase; multiple phrases whose similarity to the seed phrase is not lower than the similarity threshold are selected as initial phrases .
在一些实施例中,该方法还包括:根据训练语料中的各个分词与第一种子分词的相似性,选取训练语料中的多个分词作为初始分词;将各个初始分词分别与第二种子分词进行组合,得到多个种子短语。In some embodiments, the method further includes: according to the similarity between each participle in the training corpus and the first seed participle, selecting a plurality of participles in the training corpus as initial participles; performing each initial participle with the second seed participle respectively; combine to get multiple seed phrases.
根据本公开的另一些实施例,提供的一种短语生成装置,包括:词性组合确定模块,用于针对获取的每个初始短语,确定初始短语中各个分词的词性和顺序,得到初始短语的词性组合,其中,词性组合为按照各个分词的顺序排列的各个分词的词性; 词性组合选取模块,用于根据每种词性组合出现的次数,选取一种或多种词性组合;备选短语生成模块,用于从备选文本的各个分词中筛选出符合选取的词性组合中的词性的分词,并按照选取的词性组合生成短语,作为备选短语;短语生成模块,用于根据每个备选短语中各个分词的紧密程度,选取备选短语作为生成的短语。According to other embodiments of the present disclosure, a phrase generation device is provided, comprising: a part-of-speech combination determination module, configured to determine, for each acquired initial phrase, the part-of-speech and order of each participle in the initial phrase, and obtain the part-of-speech of the initial phrase Combination, wherein, the part-of-speech combination is the part-of-speech of each participle arranged in the order of each participle; the part-of-speech combination selection module is used to select one or more part-of-speech combinations according to the number of times each part-of-speech combination occurs; the alternative phrase generation module, It is used to filter out the participles that match the part of speech in the selected part of speech combination from each participle of the candidate text, and generate a phrase according to the selected part of speech combination as an alternative phrase; the phrase generation module is used to generate a phrase according to each alternative phrase. The tightness of each participle is selected, and the candidate phrase is selected as the generated phrase.
根据本公开的又一些实施例,提供的一种短语生成装置,包括:处理器;以及耦接至处理器的存储器,用于存储指令,指令被处理器执行时,使处理器执行如前述任意实施例的短语生成方法。According to further embodiments of the present disclosure, a phrase generation apparatus is provided, comprising: a processor; and a memory coupled to the processor for storing instructions, and when the instructions are executed by the processor, the processor executes any of the foregoing Phrase generation method of an embodiment.
根据本公开的再一些实施例,提供的一种非瞬时性计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现前述任意实施例的短语生成方法。According to further embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the phrase generation method of any of the foregoing embodiments.
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征及其优点将会变得清楚。Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.
附图说明Description of drawings
此处所说明的附图用来提供对本公开的进一步理解,构成本申请的一部分,本公开的示意性实施例及其说明被配置为解释本公开,并不构成对本公开的不当限定。The accompanying drawings described herein are used to provide a further understanding of the present disclosure and constitute a part of this application, and the exemplary embodiments of the present disclosure and their descriptions are configured to explain the present disclosure and do not constitute an improper limitation of the present disclosure.
图1示出本公开的一些实施例的短语生成方法的流程示意图。FIG. 1 shows a schematic flowchart of a phrase generation method according to some embodiments of the present disclosure.
图2示出本公开的另一些实施例的短语生成方法的流程示意图。FIG. 2 shows a schematic flowchart of a phrase generation method according to other embodiments of the present disclosure.
图3示出本公开的又一些实施例的短语生成方法的流程示意图。FIG. 3 shows a schematic flowchart of a method for generating phrases according to further embodiments of the present disclosure.
图4示出本公开的一些实施例的短语生成装置的结构示意图。FIG. 4 shows a schematic structural diagram of a phrase generating apparatus according to some embodiments of the present disclosure.
图5示出本公开的另一些实施例的短语生成装置的结构示意图。FIG. 5 shows a schematic structural diagram of a phrase generating apparatus according to other embodiments of the present disclosure.
图6示出本公开的又一些实施例的短语生成装置的结构示意图。FIG. 6 shows a schematic structural diagram of a phrase generating apparatus according to further embodiments of the present disclosure.
具体实施方式Detailed ways
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, but not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses in any way. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present disclosure.
发明人发现:人工设置的规则不一定很通用,可能生成大量质量较差的短语,例如短语中的词语完全没有关系,表达的含义不清楚。The inventor found that the rules set manually are not necessarily very general, and may generate a large number of phrases with poor quality, for example, the words in the phrases are completely irrelevant, and the meaning of the expression is unclear.
本公开所要解决的一个技术问题是:如何提高短语生成的质量和有效率。A technical problem to be solved by the present disclosure is: how to improve the quality and efficiency of phrase generation.
本公开提供一种短语生成方法,下面结合图1~3进行描述。The present disclosure provides a phrase generation method, which will be described below with reference to FIGS. 1 to 3 .
图1为本公开短语生成方法一些实施例的流程图。如图1所示,该实施例的方法包括:步骤S102~S108。FIG. 1 is a flowchart of some embodiments of the disclosed phrase generation method. As shown in FIG. 1 , the method of this embodiment includes steps S102 to S108.
在步骤S102中,针对获取的每个初始短语,确定初始短语中各个分词的词性和顺序,得到初始短语的词性组合。In step S102, for each acquired initial phrase, the parts of speech and the sequence of each participle in the initial phrase are determined to obtain a part-of-speech combination of the initial phrase.
首先获取多个初始短语,这些初始短语可以是基于少量种子词语生成的,后续实施例中将对初始短语的生成方法进行描述。此外,初始短语也可以是预先设置的。初始短语可以是表示预设维度的短语,例如,时间维度、食品维度或化妆品维度等。例如,时间维度的短语可以包括描述时间的词语,例如春节、中秋节等。可以根据实际需求选取不同维度的初始短语,这样后续生成的短语是属于与初始短语相同维度的短语。First, a plurality of initial phrases are acquired, and these initial phrases may be generated based on a small number of seed words. The method for generating the initial phrases will be described in subsequent embodiments. In addition, the initial phrase can also be preset. The initial phrase may be a phrase representing a preset dimension, such as a time dimension, a food dimension, or a cosmetic dimension, and the like. For example, phrases in the time dimension may include words that describe time, such as Spring Festival, Mid-Autumn Festival, and the like. Initial phrases of different dimensions can be selected according to actual requirements, so that the subsequently generated phrases belong to the same dimension as the initial phrase.
在一些实施例中,针对每个初始短语进行分词和词性标注,得到初始短语中各个分词的词性,按照各个分词的顺序将各个分词的词性进行组合,得到初始短语的词性组合。即词性组合为按照各个分词的顺序排列的各个分词的词性。可以采用现有的自然语言处理(NLP)算法对各个初始短语进行分词和词性标注。词性标注例如为确定分词为名词、动词等。如表1所示为一些短语的词性组合的实施例。In some embodiments, word segmentation and part-of-speech tagging are performed for each initial phrase to obtain the parts of speech of each participle in the initial phrase, and the parts of speech of each participle are combined according to the order of each participle to obtain the part-of-speech combination of the initial phrase. That is, the part-of-speech combination is the part-of-speech of each participle arranged in the order of each participle. Word segmentation and part-of-speech tagging can be performed on each initial phrase using existing natural language processing (NLP) algorithms. For example, part-of-speech tagging is used to determine participles as nouns, verbs, and the like. Table 1 shows examples of part-of-speech combinations of some phrases.
表1Table 1
Figure PCTCN2022077155-appb-000001
Figure PCTCN2022077155-appb-000001
如表1所示,t表示时间词,v表示动词,n表示名词,r表示代词,a表示形容词。例如,中秋节送礼的词性组合为t-v,即时间词-动词。As shown in Table 1, t represents a time word, v represents a verb, n represents a noun, r represents a pronoun, and a represents an adjective. For example, the part-of-speech combination of giving gifts on the Mid-Autumn Festival is t-v, that is, time-word-verb.
在步骤S104中,根据每种词性组合出现的次数,选取一种或多种词性组合。In step S104, one or more part-of-speech combinations are selected according to the number of occurrences of each part-of-speech combination.
词性组合出现的次数越多,则被选取的概率越高。在一些实施例中,针对每种词性组合,根据该词性组合出现的次数、各个词性组合出现的次数中的最大次数和最小 次数,确定该词性组合的权重;选取权重不低于权重阈值的一种或多种词性组合。例如,采用以下公式确定每个词性组合的权重。The more times the part-of-speech combination appears, the higher the probability of being selected. In some embodiments, for each part-of-speech combination, the weight of the part-of-speech combination is determined according to the number of occurrences of the part-of-speech combination, the maximum number of times and the minimum number of occurrences of each part-of-speech combination; a weight that is not lower than the weight threshold is selected. one or more part-of-speech combinations. For example, the following formula is used to determine the weight of each part-of-speech combination.
Figure PCTCN2022077155-appb-000002
Figure PCTCN2022077155-appb-000002
公式(1)中,x i表示词性组合i出现的次数或频次,i为正整数,x max表示各个词性组合出现的次数中的最大次数,x min表示各个词性组合出现的次数中的最小次数。如表2所示为各种词性组合出现的次数和权重。 In formula (1), x i represents the number or frequency of the occurrence of part-of-speech combination i, i is a positive integer, x max represents the maximum number of occurrences of each part-of-speech combination, and x min represents the minimum number of occurrences of each part-of-speech combination . As shown in Table 2, the occurrence times and weights of various part-of-speech combinations are shown.
表2Table 2
   词性组合part-of-speech combination 次数frequency 权重Weights
11 t-vt-v 150150 0.990.99
22 t-nt-n 138138 0.930.93
33 t-at-a 109109 0.8860.886
44 v-n-r-n-av-n-r-n-a 2626 0.450.45
55
例如,可以设定权重阈值为0.35,选取权重高于0.35的词性组合。可以根据实际需求设置权重阈值,权重阈值设置的越高,选取的词性组合出现的概率越高,即该词性组合越通用,后续生成的短语也越通用,有效率越高。权重阈值设置的越低,选取的词性组合数量越多,最终生成的短语数量也越多,可以涵盖更多类型更丰富的短语。For example, the weight threshold can be set to 0.35, and part-of-speech combinations with a weight higher than 0.35 can be selected. The weight threshold can be set according to actual needs. The higher the weight threshold is set, the higher the probability of the selected part-of-speech combination appears, that is, the more general the part-of-speech combination is, the more general the subsequent generated phrases, and the higher the efficiency. The lower the weight threshold is set, the more part-of-speech combinations are selected, and the more phrases are finally generated, which can cover more types and richer phrases.
在步骤S106中,从备选文本的各个分词中筛选出符合选取的词性组合中的词性的分词,并按照选取的词性组合生成短语,作为备选短语。In step S106 , a word segment that matches the part of speech in the selected part-of-speech combination is screened from each participle of the candidate text, and a phrase is generated according to the selected part-of-speech combination as a candidate phrase.
首先获取备选文本,备选文本可以是网络平台的标题、搜索文本或者评论文章等。对每个备选文本进行分词和词性标注,得到每个备选文本中的分词和分词的词性。按照选取的词性组合,对备选文本中的每个分词和词性进行遍历,组成备选短语,得到备选短语集合。例如,基于产品标题“新年礼盒新年礼物春节礼盒年货高端礼盒装企业团购定制logo送礼年货批发年货大礼包定制坚果礼盒定制D款-双层果盘提篮【养生礼盒】礼盒”,可以得到备选短语:“新年定制礼盒(t-v-n)、春节礼盒(a-n)、春节高端礼盒(t-a-n)、年货批发(n-v)、养生企业(n-v)、养生礼盒(v-n)、新年大礼包(t-n)、定制企业(v-n)…”等等。First, obtain alternative texts, which can be titles of online platforms, search texts, or commented articles. Perform word segmentation and part-of-speech tagging on each candidate text to obtain the word segmentation and part-of-speech in each candidate text. According to the selected part-of-speech combination, traverse each participle and part-of-speech in the candidate text to form candidate phrases, and obtain the candidate phrase set. For example, based on the product title "New Year gift box, New Year gift box, Spring Festival gift box, New Year gift box, high-end gift box, enterprise group purchase, customized logo, gift gift, New Year's gift, New Year's gift box, customized nut gift box, Customized D-Double fruit tray basket [health gift box] gift box", alternative phrases can be obtained: "New Year's custom gift box (t-v-n), Spring Festival gift box (a-n), Spring Festival high-end gift box (t-a-n), New Year's goods wholesale (n-v), Health enterprise (n-v), Health gift box (v-n), New Year gift package (t-n), Customized enterprise (v-n )…"and many more.
在步骤S108中,根据每个备选短语中各个分词的紧密程度,选取备选短语作为生成的短语。In step S108, according to the closeness of each word segment in each candidate phrase, the candidate phrase is selected as the generated phrase.
备选短语中各个分词的紧密程度用于表示各个分词的相关性或相互依赖程度,各 个分词的相关性或相互依赖程度越高,则紧密程度越高。在一些实施例中,针对每个备选短语,根据该备选短语中各个分词分别在预设文本中出现的次数以及各个分词连续在预设文本中出现的次数,确定该备选短语中各个分词的紧密程度;选取紧密程度不低于紧密程度阈值的备选短语,作为生成的短语。The degree of closeness of each participle in the candidate phrase is used to indicate the degree of correlation or interdependence of each participle. The higher the degree of correlation or interdependence of each participle, the higher the degree of closeness. In some embodiments, for each candidate phrase, according to the number of times each participle in the candidate phrase appears in the preset text respectively and the number of times each participle appears in the preset text continuously, determine each part in the candidate phrase. The degree of closeness of the word segmentation; the candidate phrases whose closeness degree is not lower than the closeness degree threshold are selected as the generated phrases.
预设文本可以是筛选出的优质文本,例如,信用度高于信用度阈值的用户提供的评论文本,好友数量高于好友数量阈值的用户提供的评论文本,或者关注用户数量高于关注数量阈值的用户提供的评论文本,或者,评论数量高于评论数量阈值或者浏览数量高于浏览数量阈值的评论文本,或者,搜索数量高于搜索数量阈值的搜索文本、标题,或者,浏览数量高于浏览数量阈值的标题等。预设文本更加符合用户的行为习惯,基于预设文本来评估备选短语的各个分词的紧密程度,更加准确。The preset text can be selected high-quality text, for example, the comment text provided by the user whose credit is higher than the credit threshold, the comment text provided by the user whose number of friends is higher than the threshold of the number of friends, or the user whose number of followers is higher than the threshold of the number of followers. The provided comment text, or the comment text with a number of comments above the threshold of the number of comments or the number of views above the threshold of the number of views, or the text of a search, a title with a number of searches above the threshold of the number of searches, or the number of views above the threshold of the number of views title, etc. The preset text is more in line with the user's behavioral habits, and it is more accurate to evaluate the closeness of each participle of the candidate phrase based on the preset text.
针对每个备选短语,各个分词分别在预设文本中出现的次数越少,各个分词连续在预设文本中出现的次数越多,则各个分词的紧密程度越高。在一些实施例中,针对每个备选短语,该备选短语中各个分词的紧密程度为各个分词连续在预设文本中出现的概率与各个分词分别在预设文本中出现的概率的乘积的比值。各个分词连续在预设文本中出现的概率或次数可以不区分各个分词的先后顺序。例如,“春节礼盒”出现1次,“礼盒春节”出现1次,则“春节”和“礼盒”连续出现的次数为2次。备选短语中各个分词的紧密程度可以采用以下公式表示。For each candidate phrase, the fewer times each participle appears in the preset text, and the more times each participle appears in the preset text in succession, the higher the degree of closeness of each participle. In some embodiments, for each candidate phrase, the degree of closeness of each participle in the candidate phrase is the product of the probability of each participle appearing continuously in the preset text and the probability of each participle appearing in the preset text respectively ratio. The probability or number of times that each word segment appears consecutively in the preset text may not distinguish the sequence of each word segment. For example, if "Spring Festival gift box" appears once, and "gift box Spring Festival" appears once, then "Spring Festival" and "gift box" appear twice in a row. The closeness of each participle in the candidate phrase can be expressed by the following formula.
Figure PCTCN2022077155-appb-000003
Figure PCTCN2022077155-appb-000003
公式(2)中,P(u,v,…)表示备选短语中各个分词连续在预设文本中出现的概率,P(u),P(v)…分别表示每个分词在预设文本中出现的概率。In formula (2), P(u,v,...) represents the probability that each participle in the candidate phrase appears consecutively in the preset text, P(u), P(v)...respectively represent each participle in the preset text. probability of appearing in .
例如,预设文本中出现100万个词(包含重复),其中,“春节”出现了8万次、礼盒”出现了6.5万次,“春节和礼盒”共同出现了5万次。那么就可以分别计算得到春节、礼盒、春节礼盒的概率期望值:P(春节)=0.08,P(礼盒)=0.65,P(春节,礼盒)=0.05,则基于公式(2),得到“春节”和“礼盒”的紧密程度
Figure PCTCN2022077155-appb-000004
For example, if there are 1 million words (including repetitions) in the preset text, among them, "Spring Festival" appears 80,000 times, "gift box" appears 65,000 times, and "Spring Festival and gift box" together appears 50,000 times. Then you can Calculate the probability expectation value of Spring Festival, gift box, and Spring Festival gift box respectively: P(Spring Festival) = 0.08, P(gift box) = 0.65, P(Spring Festival, gift box) = 0.05, then based on formula (2), get "Spring Festival" and "Gift box"' closeness
Figure PCTCN2022077155-appb-000004
上述实施例的方法中首先获取多个初始短语,确定每个初始短语的词性组合,进而根据各种词性组合出现的次数,选取一种或多种词性组合。根据选取的词性组合从备选文本中生成备选短语。进一步基于备选短语中各个分词的紧密程度,选取备选短语作为生成的短语。上述实施例的方法可以基于少量的初始短语,从备选文本中选出大量的备选短语,并且这些备选短语的词性组合是更加通用和更加符合逻辑的,进一 步根据备选短语中分词的紧密程度进行筛选,使得最终生成的短语中的分词的紧密程度更高,避免完全没有关系的分词组成短语,提高短语生成的质量和有效率。In the method of the above embodiment, a plurality of initial phrases are obtained first, the part-of-speech combination of each initial phrase is determined, and then one or more part-of-speech combinations are selected according to the number of occurrences of various part-of-speech combinations. Generate alternative phrases from alternative text based on selected part-of-speech combinations. Further based on the closeness of each participle in the candidate phrase, the candidate phrase is selected as the generated phrase. The method of the above embodiment can select a large number of candidate phrases from the candidate text based on a small number of initial phrases, and the part-of-speech combination of these candidate phrases is more general and more logical, and further according to the participle of the candidate phrases. The degree of closeness is screened, so that the degree of closeness of the word segmentation in the final generated phrase is higher, avoiding completely irrelevant word segmentation to form a phrase, and improving the quality and efficiency of phrase generation.
在一些实施例中,根据生成的短语生成对象的标题、搜索的索引中至少一项。In some embodiments, at least one of a title of the object and an index to search is generated according to the generated phrase.
上述实施例的方法生成的短语更加合理,提高了质量和有效率,为进一步提高短语的质量和有效率,本公开进一步对生成的短语进行筛选,下面结合图2描述本公开短语生成方法的另一些实施例。The phrases generated by the method of the above embodiment are more reasonable, and the quality and efficiency are improved. In order to further improve the quality and efficiency of the phrases, the present disclosure further screens the generated phrases. The following describes another aspect of the phrase generation method of the present disclosure with reference to FIG. 2 . some examples.
图2为本公开短语生成方法另一些实施例的流程图。如图2所示,该实施例的方法在步骤S108之后,还包括:步骤S202~S206。FIG. 2 is a flow chart of other embodiments of the method for generating phrases of the present disclosure. As shown in FIG. 2, after step S108, the method of this embodiment further includes: steps S202-S206.
在步骤S202中,在生成的短语中包括具有相同分词且分词的顺序不同的多个短语的情况下,确定多个短语中每个短语的分词序列出现的概率。In step S202, when the generated phrase includes multiple phrases with the same participle and different order of the participles, the probability of occurrence of the participle sequence of each phrase in the multiple phrases is determined.
例如,“春节礼盒”,“礼盒春节”属于具有相同分词且分词顺序不同的两个短语,基于前述实施例的方法,两者对应的紧密程度是相同的,但是“春节礼盒”属于更加通顺,质量更高的短语。因此,执行后续步骤筛选质量更高的短语。例如,将每个短语的分词序列输入预先训练的自然语言处理模型,得到每个短语的分词序列出现的概率。自然语言处理模型例如为N-Gram模型,可以采用现有模型,在此不再赘述。自然语言处理模型可以预先使用互联网平台的语料进行训练。For example, "Spring Festival gift box" and "Gift box Spring Festival" belong to two phrases with the same participle but different order of participles. Based on the method of the foregoing embodiment, the degree of closeness corresponding to the two is the same, but "Spring Festival gift box" belongs to more fluent, Higher quality phrases. Therefore, perform the next steps to filter for higher quality phrases. For example, inputting the sequence of participles for each phrase into a pre-trained natural language processing model, the probability of occurrence of the sequence of participles for each phrase is obtained. The natural language processing model is, for example, an N-Gram model, and an existing model can be used, which will not be repeated here. The natural language processing model can be trained in advance using the corpus of the Internet platform.
在步骤S204中,根据各个短语的分词序列出现的概率,确定各个短语的通顺度。In step S204, the fluency of each phrase is determined according to the probability of occurrence of the word segmentation sequence of each phrase.
短语的分词序列出现的概率越高,则该短语的通顺度越高。例如,针对每个短语,将该短语的分词序列出现的概率的倒数,按照分词的个数开方,得到该短语的通顺度。可以采用以下公式确定短语的通顺度。The higher the probability of the occurrence of the participle sequence of a phrase, the higher the fluency of the phrase. For example, for each phrase, the reciprocal of the probability of occurrence of the sequence of the phrase's participles is squared according to the number of participles to obtain the fluency of the phrase. The following formula can be used to determine the fluency of a phrase.
Figure PCTCN2022077155-appb-000005
Figure PCTCN2022077155-appb-000005
公式(3)中P(w 1w 2…w N)表示短语的分词序列出现的概率。 In formula (3), P(w 1 w 2 . . . w N ) represents the probability of occurrence of the word segmentation sequence of the phrase.
在步骤S206中,根据各个短语的通顺度,选取一个或多个短语,更新为生成的短语。In step S206, one or more phrases are selected and updated to the generated phrases according to the fluency of each phrase.
例如,选取各个短语中通顺度最高的一个短语更新为生成的短语,或者,选取通顺度高于通顺度阈值的一个或多个短语更新为生成的短语。For example, a phrase with the highest degree of fluency among the phrases is selected to be updated as the generated phrase, or one or more phrases with a degree of fluency higher than a threshold of fluency are selected to be updated as the generated phrase.
上述实施例的方法,选取通顺度更高的短语进一步提高了短语的生成质量和有效率。In the method of the above embodiment, selecting a phrase with a higher degree of fluency further improves the quality and efficiency of phrase generation.
下面结合图3描述前述实施例中的初始短语如何生成。The following describes how the initial phrase in the foregoing embodiment is generated in conjunction with FIG. 3 .
图3为本公开短语生成方法又一些实施例的流程图。如图3所示,该实施例的方 法包括:步骤S302~S306。FIG. 3 is a flowchart of further embodiments of the disclosed phrase generation method. As shown in Fig. 3, the method of this embodiment includes steps S302-S306.
在步骤S302中,根据训练语料中的各个分词与第一种子分词的相似性,选取训练语料中的多个分词作为初始分词。In step S302, according to the similarity between each participle in the training corpus and the first seed participle, a plurality of participles in the training corpus are selected as initial participles.
首先可以对训练语料进行预处理,训练预料可以是从互联网平台获取的各种语料,也可以是从开源的语料库中获取的。预处理包括主要进行繁体转简体、大写转小写、删除特殊字符(如@、#、&等字符)等。对预处理后的训练语料进行分词和词性标注等操作。前述实施例中的备选文本和预设文本也可进行相似的预处理过程,分词和词性标注也可以预先执行。First, the training corpus can be preprocessed, and the training is expected to be various corpora obtained from the Internet platform, or it can be obtained from an open source corpus. Preprocessing includes mainly converting traditional Chinese to simplified, uppercase to lowercase, and deleting special characters (such as @, #, &, etc.). Perform word segmentation and part-of-speech tagging on the preprocessed training corpus. Similar preprocessing processes can also be performed on the candidate texts and preset texts in the foregoing embodiments, and word segmentation and part-of-speech tagging can also be performed in advance.
初始短语可以表示预设维度的短语,第一种子分词也可以是表示预设维度的种子分词。以时间维度为例,第一种子分词例如为妇女节、儿童节、端午节、纪念日等等。The initial phrase may represent a phrase of a preset dimension, and the first seed segment may also be a seed segment representing a preset dimension. Taking the time dimension as an example, the first seed participles are, for example, Women's Day, Children's Day, Dragon Boat Festival, Memorial Day, and so on.
在一些实施例中,分别确定训练语料中的各个分词与第一种子分词的相似性,选取相似度不低于分词相似度阈值的分词作为初始分词。例如,分别确定训练语料中的各个分词和第一种子分词的向量;根据训练语料中的各个分词的向量和第一种子分词的向量的相似度,确定训练语料中的各个分词和第一种子分词的相似性;选取与第一种子分词的相似性不低于分词相似度阈值的分词,作为初始分词。In some embodiments, the similarity between each participle in the training corpus and the first seed participle is determined respectively, and a participle whose similarity is not lower than a threshold of the similarity of the participle is selected as the initial participle. For example, determine the vectors of each participle and the first seed participle in the training corpus respectively; determine each participle and the first seed participle in the training corpus according to the similarity between the vectors of each participle in the training corpus and the vector of the first seed participle The similarity of the first seed participle is not lower than the similarity threshold of the first seed participle as the initial participle.
可以将训练语料中的各个分词和第一种子分词输入预先训练的词向量模型,得到个分词的向量和第一种子分词的向量。词向量模型可可以采用现有模型,例如Bert模型等,不限于所举示例。各个分词的向量和第一种子分词的向量的相似度可以采用余弦相似度进行计算。例如采用以下公式计算每个分词和第一种子分词的相似性。Each word segment and the first seed segment in the training corpus can be input into the pre-trained word vector model to obtain the vector of each segment and the vector of the first seed segment. The word vector model may adopt an existing model, such as a Bert model, etc., and is not limited to the examples. The similarity between the vector of each segment and the vector of the first seed segment can be calculated by using cosine similarity. For example, the following formula is used to calculate the similarity between each participle and the first seed participle.
Figure PCTCN2022077155-appb-000006
Figure PCTCN2022077155-appb-000006
公式(4)中,s i表示第i个分词的向量,i为正整数,s j表示第一种子分词的向量。例如,根据上述方法可以得到与第一种子分词“中秋节”相似的分词,如表3所示。 In formula (4), s i represents the vector of the ith participle, i is a positive integer, and s j represents the vector of the first seed participle. For example, according to the above method, a participle similar to the first seed participle "Mid-Autumn Festival" can be obtained, as shown in Table 3.
表3table 3
IdId 相似词similar words 相似度similarity
11 端午节Dragon Boat Festival 0.877297750.87729775
22 国庆节National Day 0.796896150.79689615
33 重阳节Double Ninth Festival 0.782419740.78241974
44 七夕节Qixi Festival 0.737594420.73759442
55 ….. … ……...
在步骤S304中,将各个初始分词分别与第二种子分词进行组合,得到多个种子 短语。In step S304, each initial participle is combined with the second seed participle to obtain a plurality of seed phrases.
第二种子分词可以有多个,可以分别将各个初始分词分别与各个第二种子分词进行组合,得到多个种子短语。There may be multiple second seed word segments, and each initial word segment may be combined with each second seed word segment to obtain multiple seed phrases.
在步骤S306中,根据训练语料中的各个短语与种子短语的相似性,选取训练语料中的多个短语作为初始短语。In step S306, according to the similarity between each phrase in the training corpus and the seed phrase, a plurality of phrases in the training corpus are selected as initial phrases.
在一些实施例中,分别确定训练语料中的各个短语和种子短语的向量;根据训练语料中的各个短语的向量和种子短语的向量的相似度,确定训练语料中的各个短语与种子短语的相似性;选取与种子短语的相似性不低于相似性阈值的多个短语,作为初始短语。各个短语的向量和种子短语的向量可以利用词向量模型确定,各个短语与种子短语的相似性的确定方法可以与各个分词和第一种子分词的相似性的确定方法相同或相似。In some embodiments, the vectors of each phrase and the seed phrase in the training corpus are respectively determined; according to the similarity of the vector of each phrase in the training corpus and the vector of the seed phrase, the similarity between each phrase in the training corpus and the seed phrase is determined multiple phrases whose similarity to the seed phrase is not lower than the similarity threshold are selected as initial phrases. The vector of each phrase and the vector of the seed phrase can be determined by using the word vector model, and the method of determining the similarity between each phrase and the seed phrase can be the same as or similar to the method of determining the similarity between each word segment and the first seed segment.
上述实施例的方法可以基于少量的第一种子分词和第二种子分词,从训练语料中挖掘出大量的初始短语,用于后续生成短语,提高短语的质量和丰富性。The method of the above embodiment can mine a large number of initial phrases from the training corpus based on a small number of first seed segmentations and second seed segmentations for subsequent generation of phrases, thereby improving the quality and richness of phrases.
本公开还提供一种短语生成装置,下面结合图5进行描述。The present disclosure also provides an apparatus for generating a phrase, which will be described below with reference to FIG. 5 .
图4为本公开短语生成装置的一些实施例的结构图。如图4所示,该实施例的装置40包括:词性组合确定模块410,词性组合选取模块420,备选短语生成模块430,短语生成模块440。FIG. 4 is a block diagram of some embodiments of the disclosed phrase generating apparatus. As shown in FIG. 4 , the apparatus 40 in this embodiment includes: a part-of-speech combination determination module 410 , a part-of-speech combination selection module 420 , a candidate phrase generation module 430 , and a phrase generation module 440 .
词性组合确定模块410用于针对获取的每个初始短语,确定初始短语中各个分词的词性和顺序,得到初始短语的词性组合,其中,词性组合为按照各个分词的顺序排列的各个分词的词性。The part-of-speech combination determining module 410 is configured to, for each acquired initial phrase, determine the part-of-speech and order of each participle in the initial phrase, and obtain a part-of-speech combination of the initial phrase, wherein the part-of-speech combination is the part-of-speech of each participle arranged in the order of each participle.
在一些实施例中,词性组合确定模块410用于根据训练语料中的各个短语与种子短语的相似性,选取训练语料中的多个短语作为初始短语。In some embodiments, the part-of-speech combination determining module 410 is configured to select a plurality of phrases in the training corpus as initial phrases according to the similarity between each phrase in the training corpus and the seed phrase.
在一些实施例中,词性组合确定模块410用于分别确定训练语料中的各个短语和种子短语的向量;根据训练语料中的各个短语的向量和种子短语的向量的相似度,确定训练语料中的各个短语与种子短语的相似性;选取与种子短语的相似性不低于相似性阈值的多个短语,作为初始短语。In some embodiments, the part-of-speech combination determination module 410 is used to determine the vectors of each phrase and the seed phrase in the training corpus respectively; according to the similarity between the vectors of each phrase in the training corpus and the vector of the seed phrase, The similarity between each phrase and the seed phrase; multiple phrases whose similarity to the seed phrase is not lower than the similarity threshold are selected as initial phrases.
在一些实施例中,词性组合确定模块410用于根据训练语料中的各个分词与第一种子分词的相似性,选取训练语料中的多个分词作为初始分词;将各个初始分词分别与第二种子分词进行组合,得到多个种子短语。In some embodiments, the part-of-speech combination determination module 410 is configured to select a plurality of word segments in the training corpus as initial word segments according to the similarity between each word segment in the training corpus and the first seed segment; Participles are combined to obtain multiple seed phrases.
词性组合选取模块420用于根据每种词性组合出现的次数,选取一种或多种词性 组合。The part-of-speech combination selection module 420 is configured to select one or more part-of-speech combinations according to the number of occurrences of each part-of-speech combination.
在一些实施例中,词性组合选取模块420用于针对每种词性组合,根据该词性组合出现的次数、各个词性组合出现的次数中的最大次数和最小次数,确定该词性组合的权重;选取权重不低于权重阈值的一种或多种词性组合。In some embodiments, the part-of-speech combination selection module 420 is configured to, for each part-of-speech combination, determine the weight of the part-of-speech combination according to the number of occurrences of the part-of-speech combination, the maximum number and the minimum number of occurrences of each part-of-speech combination; select the weight One or more part-of-speech combinations not lower than the weight threshold.
备选短语生成模块430用于从备选文本的各个分词中筛选出符合选取的词性组合中的词性的分词,并按照选取的词性组合生成短语,作为备选短语。The candidate phrase generation module 430 is configured to filter out word segmentations that match the selected part-of-speech combination from each participle of the candidate text, and generate a phrase according to the selected part-of-speech combination as a candidate phrase.
短语生成模块440用于根据每个备选短语中各个分词的紧密程度,选取备选短语作为生成的短语。The phrase generation module 440 is configured to select candidate phrases as the generated phrases according to the closeness of each word segment in each candidate phrase.
在一些实施例中,短语生成模块440用于针对每个备选短语,根据该备选短语中各个分词分别在预设文本中出现的次数以及各个分词连续在预设文本中出现的次数,确定该备选短语中各个分词的紧密程度;选取紧密程度不低于紧密程度阈值的备选短语,作为生成的短语。In some embodiments, the phrase generation module 440 is configured to, for each candidate phrase, determine according to the number of times that each participle in the candidate phrase appears in the preset text respectively and the number of times that each participle appears in the preset text continuously. The degree of closeness of each participle in the candidate phrase; the candidate phrase whose degree of closeness is not lower than the threshold of the degree of closeness is selected as the generated phrase.
在一些实施例中,针对每个备选短语,该备选短语中各个分词的紧密程度为各个分词连续在预设文本中出现的概率与各个分词分别在预设文本中出现的概率的乘积的比值。In some embodiments, for each candidate phrase, the degree of closeness of each participle in the candidate phrase is the product of the probability of each participle appearing continuously in the preset text and the probability of each participle appearing in the preset text respectively ratio.
在一些实施例中,短语生成模块440还用于在生成的短语中包括具有相同分词且分词的顺序不同的多个短语的情况下,确定多个短语中每个短语的分词序列出现的概率;根据各个短语的分词序列出现的概率,确定各个短语的通顺度;根据各个短语的通顺度,选取一个或多个短语,更新为生成的短语。In some embodiments, the phrase generation module 440 is further configured to determine the probability of occurrence of the word segmentation sequence of each phrase in the plurality of phrases when the generated phrase includes a plurality of phrases with the same word segmentation and different order of the word segmentation; According to the probability of occurrence of the word segmentation sequence of each phrase, the fluency of each phrase is determined; according to the fluency of each phrase, one or more phrases are selected and updated to the generated phrase.
在一些实施例中,短语生成模块440用于将每个短语的分词序列输入预先训练的自然语言处理模型,得到每个短语的分词序列出现的概率。In some embodiments, the phrase generation module 440 is configured to input the word segmentation sequence of each phrase into a pre-trained natural language processing model to obtain the probability of occurrence of the word segmentation sequence of each phrase.
在一些实施例中,短语生成模块440用于针对每个短语,将该短语的分词序列出现的概率的倒数,按照分词的个数开方,得到该短语的通顺度。In some embodiments, the phrase generation module 440 is configured to, for each phrase, take the reciprocal of the probability of occurrence of the sequence of word segments of the phrase, and square the number of the segment words to obtain the fluency of the phrase.
本公开的实施例中的短语生成装置可各由各种计算设备或计算机系统来实现,下面结合图5以及图6进行描述。The phrase generating apparatuses in the embodiments of the present disclosure may each be implemented by various computing devices or computer systems, which will be described below in conjunction with FIG. 5 and FIG. 6 .
图5为本公开短语生成装置的一些实施例的结构图。如图5所示,该实施例的装置50包括:存储器510以及耦接至该存储器510的处理器520,处理器520被配置为基于存储在存储器510中的指令,执行本公开中任意一些实施例中的短语生成方法。FIG. 5 is a block diagram of some embodiments of the disclosed phrase generating apparatus. As shown in FIG. 5 , the apparatus 50 of this embodiment includes a memory 510 and a processor 520 coupled to the memory 510, the processor 520 being configured to execute any of the implementations of the present disclosure based on instructions stored in the memory 510 The phrase generation method in the example.
其中,存储器510例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)、数据库以及其 他程序等。The memory 510 may include, for example, a system memory, a fixed non-volatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs.
图6为本公开短语生成装置的另一些实施例的结构图。如图6所示,该实施例的装置60包括:存储器610以及处理器620,分别与存储器510以及处理器520类似。还可以包括输入输出接口630、网络接口640、存储接口650等。这些接口630,640,650以及存储器610和处理器620之间例如可以通过总线660连接。其中,输入输出接口630为显示器、鼠标、键盘、触摸屏等输入输出设备提供连接接口。网络接口640为各种联网设备提供连接接口,例如可以连接到数据库服务器或者云端存储服务器等。存储接口650为SD卡、U盘等外置存储设备提供连接接口。FIG. 6 is a structural diagram of other embodiments of the phrase generating apparatus of the present disclosure. As shown in FIG. 6 , the apparatus 60 of this embodiment includes: a memory 610 and a processor 620 , which are similar to the memory 510 and the processor 520 respectively. It may also include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630 , 640 , 650 and the memory 610 and the processor 620 can be connected, for example, through a bus 660 . The input and output interface 630 provides a connection interface for input and output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networked devices, for example, it can be connected to a database server or a cloud storage server. The storage interface 650 provides a connection interface for external storage devices such as SD cards and U disks.
本领域内的技术人员应当明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein .
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解为可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生被配置为实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce An apparatus configured to implement the functions specified in a flow or flows of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供被配置为实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps configured to implement the functions specified in a flow or flows of the flowcharts and/or a block or blocks of the block diagrams.
以上所述仅为本公开的较佳实施例,并不用以限制本公开,凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the protection of the present disclosure. within the range.

Claims (13)

  1. 一种短语生成方法,包括:A phrase generation method including:
    针对获取的每个初始短语,确定所述初始短语中各个分词的词性和顺序,得到所述初始短语的词性组合,其中,所述词性组合为按照各个分词的顺序排列的各个分词的词性;For each obtained initial phrase, determine the part of speech and order of each participle in the initial phrase, and obtain the part of speech combination of the initial phrase, wherein the part of speech combination is the part of speech of each participle arranged in the order of each participle;
    根据每种词性组合出现的次数,选取一种或多种词性组合;According to the number of occurrences of each part-of-speech combination, select one or more part-of-speech combinations;
    从备选文本的各个分词中筛选出符合选取的词性组合中的词性的分词,并按照选取的词性组合生成短语,作为备选短语;Screen out the participles that match the part of speech in the selected part-of-speech combination from each participle of the candidate text, and generate a phrase according to the selected part-of-speech combination as an alternative phrase;
    根据每个备选短语中各个分词的紧密程度,选取备选短语作为生成的短语。According to the closeness of each participle in each candidate phrase, the candidate phrase is selected as the generated phrase.
  2. 根据权利要求1所述的短语生成方法,其中,所述根据每个备选短语中各个分词的紧密程度,选取备选短语作为生成的短语包括:The phrase generation method according to claim 1, wherein, according to the tightness of each participle in each candidate phrase, selecting the candidate phrase as the generated phrase comprises:
    针对每个备选短语,根据该备选短语中各个分词分别在预设文本中出现的次数以及各个分词连续在预设文本中出现的次数,确定该备选短语中各个分词的紧密程度;For each candidate phrase, determine the degree of closeness of each participle in the candidate phrase according to the number of times each participle in the candidate phrase appears in the preset text respectively and the number of times each participle continuously appears in the preset text;
    选取紧密程度不低于紧密程度阈值的备选短语,作为生成的短语。Candidate phrases whose closeness is not lower than the closeness threshold are selected as the generated phrases.
  3. 根据权利要求2所述的短语生成方法,其中,针对每个备选短语,该备选短语中各个分词的紧密程度为各个分词连续在预设文本中出现的概率与各个分词分别在预设文本中出现的概率的乘积的比值。The phrase generation method according to claim 2, wherein, for each candidate phrase, the degree of closeness of each participle in the candidate phrase is the probability that each participle continuously appears in the preset text and the probability of each participle appearing in the preset text respectively. The ratio of the products of the probabilities of occurrence in .
  4. 根据权利要求1所述的短语生成方法,其中,所述根据每种词性组合出现的次数,选取一种或多种词性组合包括:The phrase generation method according to claim 1, wherein, according to the number of occurrences of each part-of-speech combination, selecting one or more part-of-speech combinations comprises:
    针对每种词性组合,根据该词性组合出现的次数、各个词性组合出现的次数中的最大次数和最小次数,确定该词性组合的权重;For each part-of-speech combination, determine the weight of the part-of-speech combination according to the number of occurrences of the part-of-speech combination, the maximum number and the minimum number of occurrences of each part-of-speech combination;
    选取权重不低于权重阈值的一种或多种词性组合。Select one or more part-of-speech combinations whose weight is not lower than the weight threshold.
  5. 根据权利要求1所述的短语生成方法,还包括:The phrase generation method according to claim 1, further comprising:
    在生成的短语中包括具有相同分词且分词的顺序不同的多个短语的情况下,确定多个短语中每个短语的分词序列出现的概率;In the case where the generated phrase includes multiple phrases with the same participle and different order of the participles, determining the probability of occurrence of the participle sequence of each phrase in the multiple phrases;
    根据各个短语的分词序列出现的概率,确定各个短语的通顺度;Determine the fluency of each phrase according to the probability of occurrence of the word segmentation sequence of each phrase;
    根据各个短语的通顺度,选取一个或多个短语,更新为生成的短语。According to the fluency of each phrase, one or more phrases are selected and updated to the generated phrases.
  6. 根据权利要求5所述的短语生成方法,其中,所述确定多个短语中每个短语的分词序列出现的概率包括:The phrase generation method according to claim 5, wherein said determining the probability of occurrence of the word segmentation sequence of each phrase in the plurality of phrases comprises:
    将每个短语的分词序列输入预先训练的自然语言处理模型,得到每个短语的分词序列出现的概率。The word segmentation sequence of each phrase is input into the pre-trained natural language processing model, and the probability of occurrence of the segmentation sequence of each phrase is obtained.
  7. 根据权利要求5所述的短语生成方法,其中,所述根据各个短语的分词序列出现的概率,确定各个短语的通顺度包括:The phrase generation method according to claim 5, wherein the determining the fluency of each phrase according to the probability of occurrence of the word segmentation sequence of each phrase comprises:
    针对每个短语,将该短语的分词序列出现的概率的倒数,按照分词的个数开方,得到该短语的通顺度。For each phrase, the reciprocal of the probability of occurrence of the sequence of the phrase's participles is squared according to the number of participles to obtain the fluency of the phrase.
  8. 根据权利要求1所述的短语生成方法,还包括:The phrase generation method according to claim 1, further comprising:
    根据训练语料中的各个短语与种子短语的相似性,选取训练语料中的多个短语作为初始短语。According to the similarity between each phrase in the training corpus and the seed phrase, multiple phrases in the training corpus are selected as initial phrases.
  9. 根据权利要求8所述的短语生成方法,其中,所述根据训练语料中的各个短语与种子短语的相似性,选取训练语料中的多个短语作为初始短语包括:The phrase generation method according to claim 8, wherein, according to the similarity between each phrase in the training corpus and the seed phrase, selecting a plurality of phrases in the training corpus as initial phrases comprises:
    分别确定训练语料中的各个短语和种子短语的向量;Determine the vectors of each phrase and seed phrase in the training corpus respectively;
    根据训练语料中的各个短语的向量和所述种子短语的向量的相似度,确定训练语料中的各个短语与种子短语的相似性;Determine the similarity between each phrase in the training corpus and the seed phrase according to the similarity between the vector of each phrase in the training corpus and the vector of the seed phrase;
    选取与种子短语的相似性不低于相似性阈值的多个短语,作为初始短语。Multiple phrases whose similarity to the seed phrase is not lower than the similarity threshold are selected as initial phrases.
  10. 根据权利要求8所述的短语生成方法,还包括:The phrase generation method according to claim 8, further comprising:
    根据训练语料中的各个分词与第一种子分词的相似性,选取训练语料中的多个分词作为初始分词;According to the similarity between each participle in the training corpus and the first seed participle, multiple participles in the training corpus are selected as initial participles;
    将各个初始分词分别与第二种子分词进行组合,得到多个种子短语。Combine each initial participle with the second seed participle to obtain multiple seed phrases.
  11. 一种短语生成装置,包括:A phrase generating device, comprising:
    词性组合确定模块,用于针对获取的每个初始短语,确定所述初始短语中各个分词的词性和顺序,得到所述初始短语的词性组合,其中,所述词性组合为按照各个分词的顺序排列的各个分词的词性;The part-of-speech combination determination module is used to determine the part-of-speech and the order of each participle in the initial phrase for each acquired initial phrase, and obtain the part-of-speech combination of the initial phrase, wherein the part-of-speech combination is arranged in the order of each participle The part of speech of each participle;
    词性组合选取模块,用于根据每种词性组合出现的次数,选取一种或多种词性组合;The part-of-speech combination selection module is used to select one or more part-of-speech combinations according to the number of occurrences of each part-of-speech combination;
    备选短语生成模块,用于从备选文本的各个分词中筛选出符合选取的词性组合中的词性的分词,并按照选取的词性组合生成短语,作为备选短语;The alternative phrase generation module is used to screen out the participles that conform to the selected part of speech combination from each participle of the alternative text, and generate a phrase according to the selected part of speech combination as an alternative phrase;
    短语生成模块,用于根据每个备选短语中各个分词的紧密程度,选取备选短语作为生成的短语。The phrase generation module is used for selecting candidate phrases as the generated phrases according to the closeness of each participle in each candidate phrase.
  12. 一种短语生成装置,包括:A phrase generating device, comprising:
    处理器;以及processor; and
    耦接至所述处理器的存储器,用于存储指令,所述指令被所述处理器执行时,使所述处理器执行如权利要求1-10任一项所述的短语生成方法。a memory coupled to the processor for storing instructions which, when executed by the processor, cause the processor to perform the phrase generation method of any one of claims 1-10.
  13. 一种非瞬时性计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现权利要求1-10任一项所述方法的步骤。A non-transitory computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the steps of the method of any one of claims 1-10.
PCT/CN2022/077155 2021-03-03 2022-02-22 Phrase generation method and apparatus, and computer readable storage medium WO2022183923A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110234468.7 2021-03-03
CN202110234468.7A CN113761114A (en) 2021-03-03 2021-03-03 Phrase generation method and device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2022183923A1 true WO2022183923A1 (en) 2022-09-09

Family

ID=78786698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077155 WO2022183923A1 (en) 2021-03-03 2022-02-22 Phrase generation method and apparatus, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN113761114A (en)
WO (1) WO2022183923A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116976320A (en) * 2023-09-22 2023-10-31 湖南财信数字科技有限公司 Mechanism short extraction method, device, computer equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761114A (en) * 2021-03-03 2021-12-07 北京沃东天骏信息技术有限公司 Phrase generation method and device and computer-readable storage medium
CN114818655A (en) * 2022-05-13 2022-07-29 平安科技(深圳)有限公司 Random text generation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09330331A (en) * 1996-06-10 1997-12-22 Nippon Telegr & Teleph Corp <Ntt> Phrase detecting method
CN108108379A (en) * 2016-11-25 2018-06-01 北京国双科技有限公司 Keyword opens up the method and device of word
US20180246872A1 (en) * 2017-02-28 2018-08-30 Nice Ltd. System and method for automatic key phrase extraction rule generation
CN110162781A (en) * 2019-04-09 2019-08-23 国金涌富资产管理有限公司 A kind of finance text subjectivity sentence automatic identifying method
CN111783450A (en) * 2020-06-29 2020-10-16 中国平安人寿保险股份有限公司 Phrase extraction method and device in corpus text, storage medium and electronic equipment
CN113761114A (en) * 2021-03-03 2021-12-07 北京沃东天骏信息技术有限公司 Phrase generation method and device and computer-readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09330331A (en) * 1996-06-10 1997-12-22 Nippon Telegr & Teleph Corp <Ntt> Phrase detecting method
CN108108379A (en) * 2016-11-25 2018-06-01 北京国双科技有限公司 Keyword opens up the method and device of word
US20180246872A1 (en) * 2017-02-28 2018-08-30 Nice Ltd. System and method for automatic key phrase extraction rule generation
CN110162781A (en) * 2019-04-09 2019-08-23 国金涌富资产管理有限公司 A kind of finance text subjectivity sentence automatic identifying method
CN111783450A (en) * 2020-06-29 2020-10-16 中国平安人寿保险股份有限公司 Phrase extraction method and device in corpus text, storage medium and electronic equipment
CN113761114A (en) * 2021-03-03 2021-12-07 北京沃东天骏信息技术有限公司 Phrase generation method and device and computer-readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116976320A (en) * 2023-09-22 2023-10-31 湖南财信数字科技有限公司 Mechanism short extraction method, device, computer equipment and storage medium
CN116976320B (en) * 2023-09-22 2023-12-15 湖南财信数字科技有限公司 Mechanism short extraction method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113761114A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
Anaby-Tavor et al. Do not have enough data? Deep learning to the rescue!
US11222167B2 (en) Generating structured text summaries of digital documents using interactive collaboration
WO2022183923A1 (en) Phrase generation method and apparatus, and computer readable storage medium
US10235624B2 (en) Information processing method and apparatus
US10042896B2 (en) Providing search recommendation
Pranckevičius et al. Application of logistic regression with part-of-the-speech tagging for multi-class text classification
US20190377825A1 (en) Taxonomy enrichment using ensemble classifiers
US10678769B2 (en) Artificial intelligence system and method for auto-naming customer tree nodes in a data structure
US10366117B2 (en) Computer-implemented systems and methods for taxonomy development
CN111523304B (en) Automatic generation method of product description text based on pre-training model
CN103324621B (en) A kind of Thai text spelling correcting method and device
CN111695349A (en) Text matching method and text matching system
WO2019133506A1 (en) Intelligent routing services and systems
US20200073890A1 (en) Intelligent search platforms
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
CN110633464A (en) Semantic recognition method, device, medium and electronic equipment
JP2019121139A (en) Summarizing device, summarizing method, and summarizing program
CN110889292B (en) Text data viewpoint abstract generating method and system based on sentence meaning structure model
Twinandilla et al. Multi-document summarization using k-means and latent dirichlet allocation (lda)–significance sentences
Hapsari et al. Naive bayes classifier and word2vec for sentiment analysis on bahasa indonesia cosmetic product reviews
CN110222181B (en) Python-based film evaluation emotion analysis method
WO2021035955A1 (en) Text news processing method and device and storage medium
Dařena et al. Clients’ freely written assessment as the source of automatically mined opinions
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program
Blair et al. Unsupervised sentiment classification: a hybrid sentiment-topic model approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22762389

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE