CN117172244A - Corpus aggregation method, storage medium and electronic equipment - Google Patents

Corpus aggregation method, storage medium and electronic equipment Download PDF

Info

Publication number
CN117172244A
CN117172244A CN202210589136.5A CN202210589136A CN117172244A CN 117172244 A CN117172244 A CN 117172244A CN 202210589136 A CN202210589136 A CN 202210589136A CN 117172244 A CN117172244 A CN 117172244A
Authority
CN
China
Prior art keywords
core
corpus
target
speech
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210589136.5A
Other languages
Chinese (zh)
Inventor
雷丽莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202210589136.5A priority Critical patent/CN117172244A/en
Publication of CN117172244A publication Critical patent/CN117172244A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a corpus aggregation method, a storage medium and electronic equipment. The method relates to the technical field of language processing, and comprises the following steps: respectively segmenting the plurality of corpus according to the target word stock, and labeling the part of speech of each phrase obtained by segmentation; according to the part of speech of each phrase contained in each corpus, determining the core phrase of each corpus; sorting and grouping the plurality of corpuses based on the core phrases to obtain a plurality of core corpus sets, wherein the core corpus sets comprise a plurality of core corpuses containing target core phrases; in each core corpus set, sorting and grouping the plurality of core corpuses based on the target part-of-speech phrases in the core corpuses in sequence according to the part-of-speech priority, so as to obtain a plurality of target corpus sets, wherein the target corpus sets comprise at least one target corpus containing target core phrases and target part-of-speech phrases. The invention solves the technical problem of low corpus processing efficiency caused by analyzing single corpus.

Description

Corpus aggregation method, storage medium and electronic equipment
Technical Field
The present invention relates to the field of language processing, and in particular, to a corpus aggregation method, a storage medium, and an electronic device.
Background
For the problem of syntactic analysis of corpora, the basic task is to determine the dependency between syntactic structure and the contained vocabulary. The existing syntactic analysis is mainly aimed at analyzing the part of speech, the syntax and the dependency relationship of a single sentence, and cannot carry out unified syntactic analysis on a corpus set in a certain field so as to find out the syntactic commonality of the corpus in the field.
And the corpus in the same field is classified and aggregated, unified syntactic analysis and field labeling can be realized on multiple linguistic materials in the corpus, and the processing efficiency of the corpus can be greatly improved. In the prior art, only a single corpus can be subjected to syntactic analysis, and corpus aggregation cannot be realized, so that the processing efficiency of corpus analysis, labeling and the like is very low.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a corpus aggregation method, a storage medium and electronic equipment, which are used for at least solving the technical problem of low corpus processing efficiency caused by analyzing a single corpus.
According to an aspect of the embodiment of the present invention, there is provided a corpus aggregation method, including: respectively word segmentation is carried out on a plurality of corpus according to a target word stock, and part-of-speech tagging is carried out on each phrase obtained by word segmentation, wherein the target word stock comprises preset word segmentation phrases in the target field; according to the part of speech of each phrase contained in each corpus, determining the core phrase of each corpus; sorting and grouping the plurality of corpuses based on the core phrases to obtain a plurality of core corpus sets, wherein the core corpus sets comprise a plurality of core corpuses containing target core phrases; in each core corpus set, sorting and grouping a plurality of core corpuses according to part-of-speech priority, wherein the plurality of core corpuses are sequentially based on target part-of-speech phrases in the core corpuses to obtain a plurality of target corpus sets for corpus analysis, and the target corpus sets comprise at least one target corpus containing the target core phrases and the target part-of-speech phrases.
According to another aspect of the embodiment of the present invention, there is also provided a corpus aggregating apparatus, including: the part-of-speech unit is used for respectively segmenting a plurality of corpus according to a target word stock and marking the part of speech for each phrase obtained by segmentation, wherein the target word stock comprises preset word segmentation phrases in the target field; the determining unit is used for determining the core phrase of each corpus according to the part of speech of each phrase contained in each corpus; the core unit is used for sorting and grouping the plurality of corpuses based on the core phrases to obtain a plurality of core corpus sets, wherein the core corpus sets comprise a plurality of core corpuses containing target core phrases; the gathering unit is used for orderly grouping a plurality of core corpuses based on target part-of-speech phrases in the core corpuses according to part-of-speech priority in each core corpus set to obtain a plurality of target corpuses for corpus analysis, wherein the target corpuses comprise at least one target corpus comprising the target core phrases and the target part-of-speech phrases.
According to yet another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described corpus aggregation method when run.
According to still another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, and a processor, where the memory stores a computer program, and the processor is configured to execute the corpus aggregation method described above by using the computer program.
In the embodiment of the invention, a plurality of corpuses are segmented according to a target word stock comprising preset word segmentation phrases in a target field, part-of-speech labeling is carried out on each phrase obtained by segmentation, a core phrase of each corpus is determined according to the part-of-speech of each phrase contained in each corpus, the plurality of corpuses are ordered and grouped based on the core phrase to obtain a plurality of core corpuses, the core corpuses comprise a plurality of core corpuses comprising target core phrases, in each core corpuses, the plurality of core corpuses are ordered and grouped according to the part-of-speech priority, the plurality of core corpuses are sequentially ordered and grouped based on the target part-of-speech in the core corpuses to obtain a plurality of target corpuses, the target corpuses comprise at least one target corpus comprising the target core corpuses and the target part-of-speech phrase, the core corpuses are determined by segmentation and part-of-speech labeling, the core corpuses are ordered and grouped based on the core corpuses, the core corpuses are further ordered and grouped based on the core corpuses, the core corpuses are ordered and the core corpuses are ranked according to obtain a plurality of target corpuses, the multiple target corpuses are analyzed based on the part-of speech priority, and the single corpus is analyzed, and the technical effect is achieved, and the corpus is low.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a schematic illustration of an application environment of an alternative corpus aggregation method according to an embodiment of the application;
FIG. 2 is a flow chart of an alternative corpus aggregation method according to an embodiment of the application;
FIG. 3 is a flow chart of an alternative corpus aggregation method according to an embodiment of the application;
FIG. 4 is a flow chart of an alternative corpus aggregation method according to an embodiment of the application;
FIG. 5 is a flow chart of an alternative corpus aggregation method according to an embodiment of the application;
FIG. 6 is an aggregation diagram of an alternative corpus aggregation method according to an embodiment of the application;
FIG. 7 is a schematic structural diagram of an alternative corpus aggregating device according to an embodiment of the present application;
fig. 8 is a schematic structural view of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to one aspect of the embodiment of the invention, a corpus aggregation method is provided, and the corpus aggregation method is widely applied to full-house intelligent digital control application scenes such as intelligent Home (Smart Home), intelligent Home equipment ecology, intelligent Home (Intelligence House) ecology and the like. Alternatively, in the present embodiment, the above-described corpus aggregation method may be applied to a hardware environment constituted by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be used to provide services (such as application services and the like) for a terminal or a client installed on the terminal, a database may be set on the server or independent of the server, for providing data storage services for the server 104, and cloud computing and/or edge computing services may be configured on the server or independent of the server, for providing data computing services for the server 104.
The terminal device 102 is not limited to reporting device data to the server 104 via a network, and the device data is not limited to speech control corpus data including control instructions and the like acquired by the terminal device 102. The server 104 is not limited to use for analyzing the speech control corpus to update the operating services of the terminal device 102. The analysis of the speech control corpus by the server 104 is not limited to the application of the above-described corpus aggregation method. S102, word segmentation is carried out on the language materials, and part-of-speech tagging is carried out on the word groups. And respectively segmenting the plurality of corpus according to a target word stock, and marking the part of speech of each phrase obtained by segmentation, wherein the target word stock comprises preset word segmentation phrases in the target field. S104, determining a core phrase of the corpus. And determining the core phrase of each corpus according to the part of speech of each phrase contained in each corpus. And S106, obtaining a plurality of core corpus sets. The method comprises the steps of sorting and grouping a plurality of corpuses based on core phrases to obtain a plurality of core corpuses, wherein the core corpuses comprise a plurality of core corpuses containing target core phrases. S108, obtaining a plurality of target corpus sets. In each core corpus set, sorting and grouping the plurality of core corpuses based on target part-of-speech phrases in the core corpuses in sequence according to part-of-speech priority to obtain a plurality of target corpus sets for corpus analysis, wherein the target corpus sets comprise at least one target corpus containing target core phrases and target part-of-speech phrases.
The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: a wide area network, a metropolitan area network, a local area network, and the wireless network may include, but is not limited to, at least one of: WIFI (Wireless Fidelity ), bluetooth. The terminal device 102 may not be limited to be a PC, a mobile phone, a tablet computer, an intelligent air conditioner, an intelligent smoke machine, an intelligent refrigerator, an intelligent oven, an intelligent stove, an intelligent washing machine, an intelligent water heater, an intelligent washing device, an intelligent dish washer, an intelligent projection device, an intelligent television, an intelligent clothes hanger, an intelligent curtain, an intelligent video, an intelligent socket, an intelligent sound box, an intelligent fresh air device, an intelligent kitchen and toilet device, an intelligent bathroom device, an intelligent floor sweeping robot, an intelligent window cleaning robot, an intelligent mopping robot, an intelligent air purifying device, an intelligent steam box, an intelligent microwave oven, an intelligent kitchen appliance, an intelligent purifier, an intelligent drinking fountain, an intelligent door lock, and the like, which is not limited in this embodiment.
As an optional embodiment, as shown in fig. 2, the method for aggregating the corpus includes:
s202, word segmentation is respectively carried out on a plurality of corpus according to a target word stock, and part-of-speech tagging is carried out on each phrase obtained by the word segmentation, wherein the target word stock comprises preset word segmentation phrases in the target field;
s204, determining a core phrase of each corpus according to the part of speech of each phrase contained in each corpus;
s206, sorting and grouping the plurality of corpuses based on the core phrases to obtain a plurality of core corpus sets, wherein the core corpus sets comprise a plurality of core corpuses containing target core phrases;
s208, in each core corpus set, sorting and grouping the plurality of core corpuses based on the target part-of-speech phrases in the core corpuses in sequence according to the part-of-speech priority, so as to obtain a plurality of target corpus sets for corpus analysis, wherein the target corpus sets comprise at least one target corpus containing target core phrases and target part-of-speech phrases.
The target word stock is not limited to a plurality of preset word segmentation phrases in the target field, and is used for carrying out complete word segmentation on the preset word segmentation phrases included in the corpus when the corpus is segmented, so that the preset word segmentation phrases are prevented from being split during corpus word segmentation. The target word stock can also comprise general word-splitting phrases in non-target fields, and word-splitting is carried out on the corpus through the target word stock, so that more accurate word-splitting is carried out on the corpus.
The word segmentation of the corpus is not limited to adopting barker word segmentation, the corpus is divided into phrases, and the phrases are marked based on the part of speech of each phrase. Parts of speech include, but are not limited to: verbs, nouns, adjectives, prepositions, and the like, and also includes defining parts of speech, not limited to dimension words, device component words, device words, and dimension value words.
After the part of speech tagging is performed on each phrase in the corpus, the method is not limited to determining the core phrases in the corpus based on the part of speech of each phrase. Specifically, the method is not limited to determining the phrase with the preset part of speech as the core phrase in the corpus, but also is not limited to determining the core phrase in the corpus based on the core word bank with the preset part of speech.
After the core phrase of each corpus is determined, sorting and grouping the corpora based on the core phrase is not limited to sorting based on the corpora, and then grouping the corpora according to the sorting to obtain a plurality of core corpus sets. The core phrases of the target corpora in each core corpus are the same, and the core phrases of different core corpora are not limited to be different.
After the core corpus is obtained, the core corpuses are not limited to each core corpus according to part-of-speech priority, and the plurality of core corpuses are reordered and grouped based on the target part-of-speech phrase to obtain more specific plurality of target corpuses, so that the target corpuses comprising the target core phrase and the target part-of-speech phrase are gathered in the target corpus.
In the embodiment of the invention, a plurality of corpuses are segmented according to a target word stock comprising preset word segmentation phrases in a target field, part-of-speech labeling is carried out on each phrase obtained by segmentation, a core phrase of each corpus is determined according to the part-of-speech of each phrase contained in each corpus, the plurality of corpuses are ordered and grouped based on the core phrase to obtain a plurality of core corpuses, the core corpuses comprise a plurality of core corpuses comprising target core phrases, in each core corpuses, the plurality of core corpuses are ordered and grouped according to the part-of-speech priority, the plurality of core corpuses are sequentially ordered and grouped based on the target part-of-speech phrases in the core corpuses to obtain a plurality of target corpuses, the target corpuses comprise at least one target corpus comprising the target core corpuses and the target part-of-speech phrases, the core corpuses are determined by the part-of-speech labeling, the core corpuses are ordered and grouped based on the core corpuses, the core corpuses are further ordered and grouped based on the core corpuses, the core corpuses are ordered and the core corpuses are ranked and grouped according to the part-of-speech priority, the target corpuses are analyzed, and the corpus analysis efficiency is achieved, and the corpus analysis is realized, the corpus analysis is based on the part-of the feature is low, and the corpus analysis is achieved, and the corpus analysis is based on the technical results is achieved.
As an alternative embodiment, as shown in fig. 3, determining the core phrase of each corpus according to the part of speech of each phrase included in each corpus includes:
s302, a core word stock is obtained, wherein the core word stock comprises a plurality of candidate core word groups with core parts of speech;
s304, under the condition that the corpus comprises the candidate core phrases, determining the candidate core phrases in the corpus as core phrases;
s306, under the condition that the corpus does not comprise candidate core phrases, determining the core word-part phrases in the corpus as core phrases.
And determining the candidate core phrases in the corpus as the core phrases of the corpus under the condition that the core phrases are included in the corpus. And under the condition that the corpus does not comprise the core word group, determining the word group with the core word part in the corpus as the core word group.
The determination of the core part of speech phrase in the corpus as a core phrase is not limited to the determination of the number and location of the core part of speech-based phrases. As an alternative embodiment, determining the core phrase of each corpus according to the part of speech of each phrase contained in each corpus includes:
S1, determining the core word part phrase positioned at a target position as a core phrase under the condition that the number of the core word part phrases contained in the corpus is more than one;
s2, under the condition that the corpus does not comprise the core word part phrase, determining the candidate core word part phrase positioned at the target position as the core phrase in the corpus.
And under the condition that the number of the core word-part phrases is larger than 1, determining the core word-part phrases positioned at the target positions of the plurality of core word-part phrases in the corpus, so as to determine the core word-part phrases positioned at the core positions as core phrases.
When the number of the core word-part phrases is 1, the method is not limited to directly determining the core word-part phrases as core phrases.
And under the condition that the corpus does not comprise the core word part phrase, determining the candidate core word part phrase as a core phrase. When determining the candidate core word-part phrase as a core phrase, if the number of the candidate core word-part phrases is greater than 1, the candidate core word-part phrases located at the target position among the plurality of candidate core word-part phrases are not limited to be determined as core phrases.
Taking the core part of speech as an example, determining whether the corpus contains candidate verbs in a core word library, and directly determining the candidate verbs in the corpus as the core verbs of the corpus under the condition that the candidate verbs are included. Under the condition that candidate verbs are not included, determining verb part-of-speech phrases in the corpus, and under the condition that only one verb part-of-speech phrase is included, directly determining the verb part-of-speech phrases as core phrases; in the case that more than one verb word-part phrase is included in the corpus, taking the target position as the final position as an example, determining the last verb word-part phrase in the corpus as a core phrase. Taking candidate core parts of speech as adjectives as examples when the word groups of verb parts of speech are not included in the corpus, and determining the word group of the last adjective part of speech in the corpus as the core word group.
The core word group can also be obtained from the corpus, and when the number of the core word group is more than 1, the core word group is determined from the plurality of core word groups by utilizing the core word library. Also taking the verb selected by the core part of speech and the ultraviolet sterilization mode selected by the corpus as an example, the corpus comprises three verbs of changing, forming and dividing, searching in a core word bank, and determining the changing as a core phrase if the changing is found. If none of the three verbs is found in the core lexicon, the verb located at the target position is determined to be the core verb, and taking the target position as the last verb as an example, the "divide" is determined to be the core verb.
As an alternative embodiment, as shown in fig. 4, sorting and grouping the multiple corpora based on the core phrases, to obtain multiple core corpus sets includes:
s402, sorting the plurality of corpus according to the target information of the core phrase in each corpus to obtain a first corpus sequence;
s404, determining a plurality of target core phrases in the first corpus, wherein the number of the corpuses comprising the target core phrases in the plurality of corpuses is larger than a core threshold value;
S406, grouping the plurality of corpuses in the first corpus sequence according to the target core phrase to obtain a plurality of core corpuses.
The target information is not limited to parameters of the core phrase such as pinyin, initials, strokes, radicals, etc. of the core phrase. And sequencing the plurality of corpus according to the target information of the core phrase to obtain a first corpus sequence.
And determining a plurality of target core phrases in the first corpus sequence, and grouping the corpuses in the first corpus word sequence according to the target core phrases to obtain a plurality of core corpus sets. The target core phrase is not limited to a core phrase having a corresponding corpus number greater than a core threshold. For example, if the corpus contains more core verbs than the core threshold, the core verbs are determined to be the target core phrases, so that the corpus including the target core phrases of the "change" is divided into a core corpus set.
As an optional implementation manner, in each core corpus, sorting and grouping the plurality of core corpora based on the target part-of-speech phrases in the core corpora in turn according to part-of-speech priority, so as to obtain a plurality of target corpus sets includes:
S1, determining a first text and a second text of a core corpus based on a target core phrase, wherein the first text and the second text are respectively positioned at two sides of the target core phrase in the core corpus;
s2, sorting the core corpus based on the target part-of-speech phrase in the first text and the target part-of-speech phrase in the second text according to the part-of-speech priority, and grouping the core corpus based on the target part-of-speech phrase to obtain a plurality of target corpus sets.
As an alternative embodiment, determining the first text and the second text of the core corpus based on the target core phrase includes: determining a text positioned at the left side of a target core phrase in the core corpus as a first text; and determining the text positioned on the right side of the target core phrase in the core corpus as a second text.
And sorting the target part-of-speech phrases in the left text and the target part-of-speech phrases in the right text according to the part-of-speech priority, and grouping the sorted core corpus again to obtain a target corpus set.
The part-of-speech priority is priority ranking of the parts of speech, core corpus is ranked in sequence in the first text and the second text according to the part-of-speech priority, and grouping is carried out based on the ranked core corpus, so that a target corpus set is obtained.
As an alternative embodiment, as shown in fig. 5, the sorting the core corpus according to the part-of-speech priority based on the target part-of-speech phrase in the first text and the target part-of-speech phrase in the second text includes:
s502, determining the current target part of speech corresponding to the core corpus set in the first text according to the part of speech priority;
s504, determining a current target part-of-speech phrase of a current target part-of-speech contained in a first text of each core corpus in the core corpus set;
s506, sorting the core corpus according to the target information of the current target part-of-speech phrase in the core corpus to obtain a second corpus sequence;
s508, determining the current target part of speech corresponding to the core corpus set in the second text according to the part of speech priority;
s510, determining current target part-of-speech phrases of the current target part-of-speech contained in the second text of each core corpus in the core corpus set;
s512, adjusting the second corpus sequence according to the target information of the current target part-of-speech phrase in the core corpus to obtain a third corpus sequence.
The processing order of the first text and the second text is not limited, and the first text is processed first, for example, the second text may be processed first, and then the first text may be processed. The flow of processing the first text and the second text is not limited to the same.
Taking the first text processing and the second text processing as an example, the second corpus sequence is a corpus sequence obtained by sequentially adjusting the language sequence of the core corpus in the core corpus set according to the part-of-speech priority. Taking part-of-speech priority as an example of adjectives of core verb > dimension > device component > device > dimension value > and sequentially sequencing the core corpus according to the target part-of-speech phrase in the first text according to the part-of-speech priority, so as to obtain a second corpus sequence. And then, according to the part-of-speech priority, carrying out the word sequence adjustment of the core corpus according to the target part-of-speech phrase in the second text to obtain a third corpus sequence.
When the core corpus does not include the phrase of the current target part of speech, the core corpus is ranked at the end of the current part of speech. For the word groups of other parts of speech which are not in the part of speech priority, the word groups are not used as the basis for the ordering and the ordering adjustment of the corpus, and the corpus is ordered and adjusted only for the parts of speech in the part of speech priority in sequence.
As an optional implementation manner, grouping the core corpus based on the target part-of-speech phrases to obtain a plurality of target corpus sets includes: under the condition that the target part-of-speech phrase is determined in the third corpus sequence, dividing the core corpus containing the target part-of-speech phrase into a target corpus set, wherein the number of the core corpus containing the target part-of-speech phrase in the core corpus set is larger than a target threshold value.
Using verb is selected by core part of speech, target information is the initial letter of phrase, sorting is from A-Z, part of speech priority is verb > dimension > device component > device > dimension value > adjective as example, sorting is not limited to the one shown in figure 6. Firstly traversing verb phrases, using the verb phrases as core phrases, ordering the linguistic data according to phonetic letters A-Z of the word phrases, and then dividing the linguistic data into core corpus sets by using the verb phrases as core phrases, wherein the corpus sets comprise the core phrase 'regulation' to form a core corpus set. Then traversing the text on the right side of the core verb in each core corpus set, sorting the core corpus according to the word group of the dimension part of speech and the pinyin A-Z, and similarly, sorting and adjusting the core corpus according to the pinyin A-Z by using the word group of the equipment part of speech, the word group of the dimension value part of speech and the word group of the adjective, and arranging the corpus at the rear position when the word group is not included in the corpus. Similarly, the text on the left side of the core verb is ordered sequentially according to the adjectives of verb > dimension > device component > device > dimension value > to obtain an ordering illustration as shown in fig. 6. According to the corpus aggregation method, the 'changed' corpuses are aggregated, and the 'moisture-controlled' corpuses are aggregated again in the 'changed' core corpuses, so that the corpuses with similar syntax and similar components can be aggregated together, the corpuses can be conveniently marked, analyzed and the like, and the corpus processing efficiency is improved.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
According to another aspect of the embodiment of the invention, there is also provided a corpus aggregating device for implementing the corpus aggregating method. As shown in fig. 7, the apparatus includes:
the part-of-speech unit 702 is configured to segment a plurality of corpus according to a target word stock, and label the part-of-speech for each word group obtained by the segmentation, where the target word stock includes a preset word group in a target field;
a determining unit 704, configured to determine a core phrase of each corpus according to the part of speech of each phrase included in each corpus;
a core unit 706, configured to sort and group the plurality of corpora based on the core phrases, to obtain a plurality of core corpus sets, where the core corpus sets include a plurality of core corpora including the target core phrases;
The aggregation unit 708 is configured to sequentially sort and group the plurality of core corpora based on the target part-of-speech phrases in the core corpora according to the part-of-speech priority in each core corpus, so as to obtain a plurality of target corpora sets for corpus analysis, where the target corpora sets include at least one target corpus including a target core phrase and a target part-of-speech phrase.
Optionally, the determining unit 704 includes a candidate module, configured to obtain a core word stock, where the core word stock includes a plurality of candidate core word groups with core parts of speech; under the condition that the corpus comprises the candidate core phrases, determining the candidate core phrases in the corpus as core phrases; and under the condition that the corpus does not comprise the candidate core phrases, determining the core word-part phrases in the corpus as core phrases.
Optionally, the determining unit 704 includes a core module, configured to determine, as the core phrase, a phrase with a core part of speech located at the target position when the number of phrases with the core part of speech included in the corpus is greater than one; under the condition that the corpus does not comprise the core word part phrases, determining the candidate core word part phrases positioned at the target position in the corpus as core phrases.
Optionally, the core unit 706 includes a grouping module, configured to sort the multiple corpora according to the target information of the core phrase in each corpus, so as to obtain a first corpus sequence; determining a plurality of target core phrases in the first corpus sequence, wherein the number of the corpuses comprising the target core phrases in the plurality of corpuses is larger than a core threshold value; grouping the plurality of corpuses in the first corpus sequence according to the target core phrase to obtain a plurality of core corpuses.
Optionally, the aggregating unit 708 includes a dividing module, configured to determine a first text and a second text of the core corpus based on the target core phrase, where the first text and the second text are located on two sides of the target core phrase in the core corpus respectively; and sorting the core corpus based on the target part-of-speech phrase in the first text and the target part-of-speech phrase in the second text according to the part-of-speech priority, and grouping the core corpus based on the target part-of-speech phrase to obtain a plurality of target corpus sets.
Optionally, the above-mentioned dividing module is further configured to determine, as the first text, a text located at the left side of the target core phrase in the core corpus; and determining the text positioned on the right side of the target core phrase in the core corpus as a second text.
Optionally, the above-mentioned division module is further configured to determine, according to the part-of-speech priority, a current target part-of-speech corresponding to the core corpus set in the first text; determining a current target part-of-speech phrase of a current target part-of-speech contained in a first text of each core corpus in the core corpus set; according to the target information of the current target part-of-speech phrase in the core corpus, sequencing the core corpus to obtain a second corpus sequence; determining the current target part of speech corresponding to the core corpus set in the second text according to the part of speech priority; determining current target part-of-speech phrases of the current target part-of-speech contained in the second text of each core corpus in the core corpus set; and adjusting the second corpus according to the target information of the current target part-of-speech phrase in the core corpus to obtain a third corpus.
Optionally, the dividing module is further configured to divide the core corpus containing the target part-of-speech phrase into a target corpus set under the condition that the target part-of-speech phrase is determined in the third corpus sequence, where the number of core corpora including the target part-of-speech phrase in the core corpus set is greater than a target threshold.
In the embodiment of the application, a plurality of corpuses are segmented according to a target word stock comprising preset word segmentation phrases in a target field, part-of-speech tagging is carried out on each phrase obtained by segmentation, a core phrase of each corpus is determined according to the part-of-speech of each phrase contained in each corpus, the plurality of corpuses are ordered and grouped based on the core phrase to obtain a plurality of core corpuses, the core corpuses comprise a plurality of core corpuses comprising target core phrases, in each core corpuses, the plurality of core corpuses are ordered and grouped according to the part-of-speech priority, the plurality of core corpuses are sequentially ordered and grouped based on the target part-of-speech phrases in the core corpuses to obtain a plurality of target corpuses, the target corpuses comprise at least one target corpus comprising the target core corpuses and the target part-of-speech phrases, the core corpuses are determined by the part-of-speech tagging, the core corpuses are ordered and grouped based on the core corpuses, the core corpuses are ordered and the core corpuses are further grouped according to the part-of-speech priority, the target corpuses are analyzed, and the single corpus is analyzed based on the part-of-speech analysis is achieved, and the technical effect is solved.
According to still another aspect of the embodiment of the present invention, there is further provided an electronic device for implementing the corpus aggregation method, where the electronic device may be a terminal device or a server shown in fig. 1. The present embodiment is described taking the electronic device as a server as an example. As shown in fig. 8, the electronic device comprises a memory 802 and a processor 804, the memory 802 having stored therein a computer program, the processor 804 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.
Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, word segmentation is respectively carried out on a plurality of corpus according to a target word stock, and part-of-speech tagging is carried out on each phrase obtained by the word segmentation, wherein the target word stock comprises preset word segmentation phrases in the target field;
s2, determining a core phrase of each corpus according to the part of speech of each phrase contained in each corpus;
s3, sorting and grouping the plurality of corpuses based on the core phrases to obtain a plurality of core corpus sets, wherein the core corpus sets comprise a plurality of core corpuses containing target core phrases;
S4, in each core corpus set, sorting and grouping the plurality of core corpuses based on the target part-of-speech phrases in the core corpuses in sequence according to the part-of-speech priority, so as to obtain a plurality of target corpus sets for corpus analysis, wherein the target corpus sets comprise at least one target corpus containing target core phrases and target part-of-speech phrases.
Alternatively, it will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely illustrative, and the electronic device may be any terminal device. Fig. 8 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
The memory 802 may be used to store software programs and modules, such as program instructions/modules corresponding to the corpus aggregation method and apparatus in the embodiment of the present invention, and the processor 804 executes the software programs and modules stored in the memory 802, thereby executing various functional applications and data processing, that is, implementing the corpus aggregation method described above. Memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 802 may further include memory remotely located relative to processor 804, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 802 may be specifically, but not limited to, for storing information such as a target word stock, a preset word group, a core corpus, a part-of-speech distress score, and a target corpus. As an example, as shown in fig. 8, the part-of-speech unit 702, the determining unit 704, the core unit 706, and the aggregating unit 708 in the aggregating apparatus including the corpus may be, but not limited to, in the memory 802. In addition, other module units in the corpus aggregation device may be included, but are not limited to, and are not described in detail in this example.
Optionally, the transmission device 806 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 806 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 806 is a Radio Frequency (RF) module for communicating wirelessly with the internet.
In addition, the electronic device further includes: a display 808 for displaying the corpus, the core corpus set, and the target corpus set; and a connection bus 810 for connecting the respective module parts in the above-described electronic device.
In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. Among them, the nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, etc., may become a node in the blockchain system by joining the Peer-To-Peer network.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the aggregate aspect of corpora described above. Wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, word segmentation is respectively carried out on a plurality of corpus according to a target word stock, and part-of-speech tagging is carried out on each phrase obtained by the word segmentation, wherein the target word stock comprises preset word segmentation phrases in the target field;
s2, determining a core phrase of each corpus according to the part of speech of each phrase contained in each corpus;
s3, sorting and grouping the plurality of corpuses based on the core phrases to obtain a plurality of core corpus sets, wherein the core corpus sets comprise a plurality of core corpuses containing target core phrases;
S4, in each core corpus set, sorting and grouping the plurality of core corpuses based on the target part-of-speech phrases in the core corpuses in sequence according to the part-of-speech priority, so as to obtain a plurality of target corpus sets for corpus analysis, wherein the target corpus sets comprise at least one target corpus containing target core phrases and target part-of-speech phrases.
Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (10)

1. A method for aggregating corpora, comprising:
respectively word segmentation is carried out on a plurality of corpus according to a target word stock, and part-of-speech tagging is carried out on each phrase obtained by word segmentation, wherein the target word stock comprises preset word segmentation phrases in the target field;
according to the part of speech of each phrase contained in each corpus, determining the core phrase of each corpus;
sorting and grouping the plurality of corpuses based on the core phrases to obtain a plurality of core corpus sets, wherein the core corpus sets comprise a plurality of core corpuses containing target core phrases;
In each core corpus set, sorting and grouping a plurality of core corpuses according to part-of-speech priority, wherein the plurality of core corpuses are sequentially based on target part-of-speech phrases in the core corpuses to obtain a plurality of target corpus sets for corpus analysis, and the target corpus sets comprise at least one target corpus containing the target core phrases and the target part-of-speech phrases.
2. The method of claim 1, wherein determining the core phrase of each corpus based on the part of speech of each phrase contained in the corpus comprises:
obtaining a core word stock, wherein the core word stock comprises a plurality of candidate core word groups with core parts of speech;
under the condition that the corpus comprises the candidate core phrases, determining the candidate core phrases in the corpus as the core phrases;
and under the condition that the corpus does not comprise the candidate core phrases, determining the core word part phrases in the corpus as the core phrases.
3. The method of claim 2, wherein determining the core phrase of each corpus based on the part of speech of each phrase contained in the corpus comprises:
Under the condition that the number of the core word-part phrases contained in the corpus is larger than one, determining the core word-part phrases positioned at the target position as the core phrases;
and under the condition that the word group with the core part of speech is not included in the corpus, determining the candidate word group with the core part of speech at the target position in the corpus as the core word group.
4. The method of claim 1, wherein the sorting the plurality of corpora into groups based on the core phrases, the obtaining a plurality of core corpus sets comprises:
sequencing the plurality of corpus according to the target information of the core phrase in each corpus to obtain a first corpus sequence;
determining a plurality of target core phrases in the first corpus sequence, wherein the number of the corpuses comprising the target core phrases in the plurality of corpuses is larger than a core threshold value;
grouping the plurality of corpuses in the first corpus according to the target core phrase to obtain a plurality of core corpuses.
5. The method of claim 1, wherein in each of the core corpus sets, sorting and grouping the plurality of core corpora based on the target part-of-speech phrases in the core corpus in turn according to part-of-speech priorities, to obtain a plurality of target corpus sets includes:
Determining a first text and a second text of the core corpus based on the target core phrase, wherein the first text and the second text are respectively positioned at two sides of the target core phrase in the core corpus;
and sorting the core corpus based on the target part-of-speech phrase in the first text and the target part-of-speech phrase in the second text according to the part-of-speech priority, and grouping the core corpus based on the target part-of-speech phrase to obtain the plurality of target corpus sets.
6. The method of claim 5, wherein the determining the first text and the second text of the core corpus based on the target core phrase comprises:
determining a text positioned at the left side of the target core phrase in the core corpus as the first text;
and determining the text positioned on the right side of the target core phrase in the core corpus as the second text.
7. The method of claim 5, wherein the ranking the core corpus based on the target part-of-speech phrase in the first text and the target part-of-speech phrase in the second text according to the part-of-speech priority comprises:
Determining a current target part of speech corresponding to the core corpus set in the first text according to the part of speech priority;
determining current target part-of-speech phrases of the current target part-of-speech contained in the first text of each core corpus in the core corpus set;
sorting the core corpus according to the target information of the current target part-of-speech phrase in the core corpus to obtain a second corpus sequence;
determining the current target part of speech corresponding to the core corpus set in the second text according to the part of speech priority;
determining the current target part-of-speech phrase of the current target part-of-speech contained in the second text of each core corpus in the core corpus set;
and adjusting the second corpus according to the target information of the current target part-of-speech phrase in the core corpus to obtain a third corpus.
8. The method of claim 7, wherein grouping the core corpus based on the target part-of-speech phrases to obtain the plurality of target corpus sets comprises:
under the condition that the target part-of-speech phrase is determined in the third corpus, dividing the core corpus containing the target part-of-speech phrase into the target corpus set, wherein the number of the core corpora containing the target part-of-speech phrase in the core corpus set is larger than a target threshold value.
9. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of any one of claims 1 to 8.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 8 by means of the computer program.
CN202210589136.5A 2022-05-27 2022-05-27 Corpus aggregation method, storage medium and electronic equipment Pending CN117172244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210589136.5A CN117172244A (en) 2022-05-27 2022-05-27 Corpus aggregation method, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210589136.5A CN117172244A (en) 2022-05-27 2022-05-27 Corpus aggregation method, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117172244A true CN117172244A (en) 2023-12-05

Family

ID=88934129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210589136.5A Pending CN117172244A (en) 2022-05-27 2022-05-27 Corpus aggregation method, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117172244A (en)

Similar Documents

Publication Publication Date Title
Peng et al. Retweet modeling using conditional random fields
CN104794145B (en) People are connected based on content and relationship gap
Chakraborty et al. Ferosa: A faceted recommendation system for scientific articles
CN110209809B (en) Text clustering method and device, storage medium and electronic device
US10135723B2 (en) System and method for supervised network clustering
KR101739540B1 (en) System and method for building integration knowledge base based
AU2017276360B2 (en) A system for the automated semantic analysis processing of query strings
JP2022020070A (en) Information processing, information recommendation method and apparatus, electronic device and storage media
CN110472016B (en) Article recommendation method and device, electronic equipment and storage medium
CN111291618A (en) Labeling method, device, server and storage medium
CN109819002B (en) Data pushing method and device, storage medium and electronic device
CN114780710A (en) Text matching method and device, storage medium and electronic equipment
CN112148844A (en) Information reply method and device for robot
Brunner et al. Network-aware summarisation for resource discovery in P2P-content networks
CN110019832B (en) Method and device for acquiring language model
Anglade et al. Complex-network theoretic clustering for identifying groups of similar listeners in p2p systems
CN117172244A (en) Corpus aggregation method, storage medium and electronic equipment
CN110929526A (en) Sample generation method and device and electronic equipment
Antunes et al. Semantic features for context organization
CN117272056A (en) Object feature construction method, device and computer readable storage medium
CN112765329B (en) Method and system for discovering key nodes of social network
JP2006285419A (en) Information processor, processing method and program
Casadei et al. A self-organizing approach to tuple distribution in large-scale tuple-space systems
CN113901328A (en) Information recommendation method and device, electronic equipment and storage medium
CN112434174A (en) Method, device, equipment and medium for identifying issuing account of multimedia information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination