CN112395421B

CN112395421B - Course label generation method and device, computer equipment and medium

Info

Publication number: CN112395421B
Application number: CN202110078984.5A
Authority: CN
Inventors: 熊龙飞; 张茜; 张敏; 黄敏婕; 胡立波; 余晋琳
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-05-11
Anticipated expiration: 2041-01-21
Also published as: CN112395421A

Abstract

The invention relates to the field of data processing, and discloses a method, a device, computer equipment and a medium for generating a course label, wherein the method comprises the following steps: the method comprises the steps of collecting interactive comment data of a target course to obtain an initial sentence, conducting text preprocessing on the initial sentence to obtain a processed sentence, conducting word segmentation processing on the processed sentence in a preset word segmentation mode to obtain a target word segmentation, conducting calculation on word frequency and inverse text frequency indexes on the target word segmentation based on a TF-IDF algorithm, determining an evaluation value of the target word segmentation according to the word frequency and inverse text frequency indexes, sequencing the target word segmentation according to the evaluation value of the target word segmentation, selecting the target word segmentation with a preset threshold from front to back to serve as a secondary course label, classifying the secondary course label under the preset primary course label in a clustering mode, and obtaining a target course label system of the target course. The method is beneficial to improving the precision of the generation of the course label system.

Description

Course label generation method and device, computer equipment and medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for generating a course label, a computer device, and a medium.

Background

With the rapid development of information technology, more and more enterprises can provide some learning courses for users or employees, and on some well-known resource sites, there are many kinds of courses and a large number of user groups, so that how to push the courses concerned by each group better from the perspective of supply and demand needs to obtain the labels of the massive courses. Meanwhile, the precise course label can help course providers to improve and perfect courses, and a method capable of more efficiently, precisely and automatically acquiring the course label is urgently needed except for manually marked labels with a very coarse granularity when the courses are on line at present.

At present, some schemes are available, in which some text information is obtained from the content related to the course, and then a statistical machine learning model is used to learn the word segmentation rule (called training), so as to segment the unknown text, and extract the key information as the course label. Such as a maximum probability word segmentation method, a maximum entropy word segmentation method, and the like. In practical application, the word segmentation system based on statistics needs to use a word segmentation dictionary to perform character string matching word segmentation, and the method has limitations, and when some new words appear, the new words cannot be accurately identified, so that the accuracy of generating labels is not enough.

Disclosure of Invention

The embodiment of the invention provides a method and a device for generating a course label, computer equipment and a storage medium, which are used for improving the accuracy of the generation of the course label.

In order to solve the foregoing technical problem, an embodiment of the present application provides a method for generating a course label, including:

collecting interactive comment data of a target course to obtain an initial sentence;

performing text preprocessing on the initial sentence to obtain a processed sentence;

performing word segmentation processing on the processed sentence by adopting a preset word segmentation mode to obtain a target word segmentation;

aiming at each target participle, respectively calculating word frequency and an inverse text frequency index of the target participle based on a TF-IDF algorithm, and determining an evaluation value of the target participle according to the obtained word frequency and the obtained inverse text frequency index;

sequencing the target participles according to the evaluation values of the target participles, and selecting the target participles with a preset threshold value from front to back as secondary course labels;

and classifying the secondary course labels under preset primary course labels in a clustering mode to obtain a target course label system of the target course.

Optionally, the acquiring interactive comment data of the target course to obtain the initial sentence includes:

determining the floor weight of each comment interaction floor in a link analysis mode;

determining a target floor according to each floor weight and a preset weight threshold;

calculating the ranking value of each target floor based on a preset ranking strategy, and sequencing the target floors according to the sequence of the ranking values from big to small to obtain a target floor queue;

and capturing the content in the target floor based on the target floor queue to obtain the initial sentence.

Optionally, the text preprocessing is performed on the initial sentence, and obtaining a processed sentence includes:

carrying out case unification and traditional body conversion on the initial sentence to obtain a standard text;

and extracting and labeling the useless words of the standard text to obtain a labeled processing sentence.

Optionally, a preset training corpus is obtained, and the preset training corpus is analyzed by using an N-gram model to obtain word sequence data of the preset training corpus;

the word segmentation processing is performed on the processed sentence by adopting a preset word segmentation mode, and the target word segmentation is obtained by:

performing word segmentation analysis on the processed sentences to obtain M word segmentation sequences;

aiming at each word segmentation sequence, calculating the occurrence probability of each word segmentation sequence according to word sequence data of the preset training corpus to obtain the occurrence probability of M word segmentation sequences;

and selecting the word segmentation sequence corresponding to the occurrence probability reaching a preset probability threshold from the occurrence probabilities of the M word segmentation sequences as a target word segmentation sequence, and taking each word segmentation in the target word segmentation sequence as a target word segmentation contained in the processing sentence.

Optionally, after performing word segmentation processing on the processed sentence in the preset word segmentation manner to obtain a target word segmentation, the method for generating the course label further includes:

constructing a basic word vector of each target word segmentation based on a preset corpus;

calculating the space distance between the basic word vector and each other basic word vector aiming at each basic word vector, and taking two basic word vectors corresponding to each space distance as a group of word vectors;

if the spatial distance is smaller than a preset distance threshold, determining a group of word vectors corresponding to the spatial distance as a near-meaning word vector, and acquiring two target participles corresponding to the near-meaning word vector as a group of near-meaning words;

and merging each group of similar meaning words to obtain updated target participles.

Optionally, the classifying the secondary class labels to a preset primary class label in a clustering manner to obtain the target class label system of the target class comprises:

performing word vector conversion on the preset primary course labels, and taking each obtained word vector as a clustering center;

respectively calculating the Euclidean distance from the word vector corresponding to the secondary course label to each clustering center as the space distance of the word vector corresponding to the secondary course label;

for each secondary class label, acquiring a clustering center corresponding to the spatial distance with the minimum numerical value as a target clustering center, and taking a preset primary class label corresponding to the target clustering center as a target category;

and classifying the secondary course labels into target classes corresponding to the secondary course labels aiming at each secondary course label to obtain a target course label system.

In order to solve the foregoing technical problem, an embodiment of the present application further provides an apparatus for generating a course label, including:

the data acquisition module is used for acquiring interactive comment data of the target course to obtain an initial sentence;

the preprocessing module is used for performing text preprocessing on the initial sentence to obtain a processed sentence;

the word segmentation module is used for performing word segmentation processing on the processed sentence by adopting a preset word segmentation mode to obtain a target word segmentation;

the evaluation module is used for respectively calculating word frequency and an inverse text frequency index of the target participle based on a TF-IDF algorithm aiming at each target participle, and determining the evaluation value of the target participle according to the obtained word frequency and the obtained inverse text frequency index;

the sequencing module is used for sequencing the target participles according to the evaluation values of the target participles, and selecting the target participles with preset threshold values from front to back as secondary course labels;

and the system generation module is used for classifying the secondary course labels into preset primary course labels in a clustering mode to obtain a target course label system of the target course.

Optionally, the data acquisition module comprises:

the weight determining unit is used for determining the floor weight of each comment interaction floor in a link analysis mode;

the floor determining unit is used for determining a target floor according to each floor weight and a preset weight threshold;

the floor sequencing unit is used for calculating the ranking value of each target floor based on a preset ranking strategy, and sequencing the target floors according to the sequence of the ranking values from large to small to obtain a target floor queue;

and the content capturing unit is used for capturing the content in the target floor based on the target floor queue to obtain the initial sentence.

Optionally, the pre-processing module comprises:

the text conversion unit is used for carrying out case unification and traditional body conversion on the initial sentence to obtain a standard text;

and the useless word extracting and labeling unit is used for extracting and labeling the useless words of the standard text to obtain a labeled processing statement.

Optionally, the word segmentation module comprises:

the word segmentation analysis unit is used for carrying out word segmentation analysis on the processed sentences to obtain M word segmentation sequences;

a probability calculation unit, configured to calculate, for each word segmentation sequence, an occurrence probability of each word segmentation sequence according to word sequence data of the preset training corpus, so as to obtain occurrence probabilities of M word segmentation sequences;

and the word segmentation determining unit is used for selecting the word segmentation sequence corresponding to the occurrence probability reaching a preset probability threshold from the occurrence probabilities of the M word segmentation sequences as a target word segmentation sequence, and taking each word segmentation in the target word segmentation sequence as a target word segmentation contained in the processing sentence.

Optionally, the device for generating course labels further comprises:

the vector construction module is used for constructing a basic word vector of each target word segmentation based on a preset corpus;

the vector grouping module is used for calculating the space distance between the basic word vector and each other basic word vector aiming at each basic word vector, and taking two basic word vectors corresponding to each space distance as a group of word vectors;

a near word determining module, configured to determine, if the spatial distance is smaller than a preset distance threshold, that a group of word vectors corresponding to the spatial distance are near word vectors, and obtain two target word segments corresponding to the near word vectors as a group of near words;

and the word segmentation merging module is used for merging each group of near-meaning words to obtain updated target word segmentation.

Optionally, the system generation module includes:

the vector conversion unit is used for carrying out word vector conversion on the preset primary course labels, and taking each obtained word vector as a clustering center;

the distance calculation unit is used for calculating the Euclidean distance from the word vector corresponding to the secondary course label to each clustering center as the space distance of the word vector corresponding to the secondary course label;

a clustering center determining unit, configured to obtain, for each secondary course label, a clustering center corresponding to a spatial distance with a smallest numerical value as a target clustering center, and use a preset primary course label corresponding to the target clustering center as a target category;

and the target course label system generating unit is used for classifying the secondary course labels into target categories corresponding to the secondary course labels aiming at each secondary course label to obtain the target course label system.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above course tag generation method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above course label generation method.

In the embodiment of the invention, the initial sentence is obtained by collecting the interactive comment data of the target course, the situations that the focus is not outstanding and the label is accurate due to capturing and analyzing the course content are avoided, the text preprocessing is further performed on the initial sentence to obtain the processed sentence, the word segmentation processing is performed on the processed sentence by adopting a preset word segmentation mode to obtain the target participles, the word frequency calculation and the inverse text frequency index calculation are respectively performed on the target participles on the basis of the TF-IDF algorithm aiming at each target participle, the evaluation value of the target participle is determined through the obtained word frequency and the obtained inverse text frequency index, the target participle with higher importance is screened by evaluating the importance of each target participle, and the accuracy of selecting the subsequent secondary course label is favorably improved, and sequencing the target participles according to the evaluation values of the target participles, selecting the target participles with preset threshold values from front to back to serve as second-level course labels, classifying the second-level course labels under the preset first-level course labels in a clustering mode to obtain a target course label system of the target course, and realizing the rapid and accurate generation of the course label system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating course tags of the present application;

FIG. 3 is a block diagram of one embodiment of a curriculum label generation apparatus according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like.

The

terminal devices

101, 102, 103 may be various electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablet computers, E-book readers, MP3 players (Moving Picture E interface shows a properties Group Audio Layer III, motion Picture experts compress standard Audio Layer 3), MP4 players (Moving Picture E interface shows a properties Group Audio Layer IV, motion Picture experts compress standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

The method for generating the course label provided in the embodiment of the present application is executed by the server, and accordingly, the device for generating the course label is disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the

terminal devices

101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.

Referring to fig. 2, fig. 2 shows a method for generating a course label according to an embodiment of the present invention, which is described by taking the method applied to the server in fig. 1 as an example, and is detailed as follows:

s201: and collecting interactive comment data of the target course to obtain an initial sentence.

Specifically, in this embodiment, a comment interaction area is provided below each course, and when a user learns a course, the user can discuss, evaluate, and the like the course according to the user himself, and according to the interactive comment data, the course can be labeled, so that the course can be improved and popularized later according to the label.

The comment area data are obtained by crawling in a web crawler mode to obtain initial data, and the specific implementation mode can refer to the description of the subsequent embodiment, and is not repeated here to avoid repetition.

S202: and performing text preprocessing on the initial sentence to obtain a processed sentence.

Specifically, after the initial sentence is obtained, because there are many interfering words in the initial sentence, text preprocessing needs to be performed on the initial sentence, and a processed sentence which is easy to recognize by an algorithm model is obtained, so that the accuracy of subsequent word segmentation is improved.

In this embodiment, the text preprocessing includes, but is not limited to, case conversion, simplified and traditional conversion, stop word tagging, and the like.

S203: and performing word segmentation processing on the processed sentence by adopting a preset word segmentation mode to obtain a target word segmentation.

Specifically, through a preset word segmentation mode, word segmentation processing is performed on each obtained processing statement to obtain a target word segmentation included in each processing statement.

The preset word segmentation mode includes but is not limited to: through a third-party word segmentation tool or a word segmentation algorithm, and the like.

Common third-party word segmentation tools include, but are not limited to: the system comprises a Stanford NLP word segmentation device, an ICTCLAS word segmentation system, an ansj word segmentation tool, a HanLP Chinese word segmentation tool and the like.

The word segmentation algorithm includes, but is not limited to: a Maximum forward Matching (MM) algorithm, a reverse direction Maximum Matching (RMM) algorithm, a Bi-directional Maximum Matching (BM) algorithm, a Hidden Markov Model (HMM), an N-gram Model, and the like.

It should be noted that, when performing word segmentation in a preset word segmentation manner, regarding the labeled words (refer to the description of the following embodiments), the labeled content in the processed sentence is used as a demarcation point to perform word segmentation, which is beneficial to improving the accuracy of word segmentation.

The method has the advantages that target participles are extracted in a word segmentation mode, on one hand, some meaningless words in the processed sentences can be further filtered, on the other hand, the method is also favorable for screening secondary course labels by using the target participles subsequently, and after redundant data are removed, the accuracy and efficiency of label screening are improved.

In the embodiment, the course labels to be extracted are entity words, such as current affairs, tools, guests, meetings and the like, so that in the actual word segmentation process, the complete sentence is segmented, after the word segmentation result is obtained, adjectives, adverbs and useless words (also called stop words) are removed, the entity words are reserved, the number of target segmented words to be evaluated is favorably reduced subsequently, and the efficiency of evaluating the target segmented words is improved.

S204: and aiming at each target participle, respectively calculating the word frequency and the inverse text frequency index of the target participle based on a TF-IDF algorithm, and determining the evaluation value of the target participle according to the obtained word frequency and the obtained inverse text frequency index.

Specifically, in practical application, the word segmentation system based on statistics needs to use a word segmentation dictionary to perform character string matching word segmentation, and the method has limitations, lacks pertinence to the chinese comments in the embodiment, and needs to use a statistical method to identify some new words by combining with actual course evaluation data, that is, combines the character string frequency statistics and the character string matching, and not only exerts the characteristics of high matching word segmentation speed and high efficiency, but also utilizes the advantages of dictionary-free word segmentation, context-free word identification, and automatic disambiguation.

In this embodiment, a TF-IDF algorithm is adopted, and word segmentation data existing in a course is combined to perform statistics on each target word segmentation, so as to obtain a word frequency and an inverse text frequency index of each target word segmentation, and further determine an evaluation value of each target word segmentation according to the word frequency and the inverse text frequency index.

TF-IDF is a statistical method for evaluating the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

Where Term Frequency (TF) refers to the number of times a given term appears in the document. This number is typically normalized (typically word frequency divided by the total word count of the article) to prevent it from being biased towards long documents. (the same word may have a higher word frequency in a long document than a short document, regardless of the importance of the word.)

The main idea of the reverse document frequency (IDF) IDF is as follows: if the documents containing the entry t are fewer and the IDF is larger, the entry has good category distinguishing capability. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient.

S205: and sequencing the target participles according to the evaluation values of the target participles, and selecting the target participles with preset threshold values from front to back as secondary course labels.

Specifically, the target participles are sorted according to the evaluation values of the target participles, and the target participles with preset thresholds are selected from front to back to serve as secondary course labels, wherein the preset thresholds can be selected according to actual needs.

S206: and classifying the secondary course labels under the preset primary course labels in a clustering mode to obtain a target course label system of the target course.

Specifically, each secondary course label is classified under a preset primary course label in a semantic clustering mode, so that a target course label system of the target course is obtained. For a specific implementation process, reference may be made to the description of the subsequent embodiments, and in order to avoid repetition, details are not described here.

In the embodiment, the initial sentence is obtained by collecting interactive comment data of a target course, the situation that the focus is not outstanding and the label is accurate due to the fact that the course content is captured and analyzed is avoided, the initial sentence is subjected to text preprocessing to obtain a processed sentence, word segmentation processing is performed on the processed sentence by adopting a preset word segmentation mode to obtain a target word segmentation, the target word segmentation is respectively subjected to word frequency calculation and inverse text frequency index calculation based on a TF-IDF algorithm, the evaluation value of the target word segmentation is determined according to the obtained word frequency and the obtained inverse text frequency index, the target word segmentation with higher importance is screened by evaluating the importance of each target word segmentation, the accuracy of selecting the subsequent secondary course label is improved, the target word segmentation is sequenced according to the evaluation value of the target word segmentation, and the target word segmentation with a preset threshold is selected from front to back, and as the secondary course labels, classifying the secondary course labels under the preset primary course labels in a clustering mode to obtain a target course label system of the target course, thereby realizing the rapid and accurate generation of the label system.

In some optional implementation manners of this embodiment, in step S201, acquiring interactive comment data of the target course, and obtaining the initial sentence includes:

determining a target floor according to the weight of each floor and a preset weight threshold;

and capturing the content in the target floor based on the target floor queue to obtain an initial sentence.

Specifically, before crawling the page, Link analysis is performed on each comment interaction floor to be crawled, the weight of each floor is confirmed, so that a target floor to be crawled is determined according to the weight in the following process, a reference weight is preset at a server side, when the calculated floor weight is larger than the preset reference weight, the floor is confirmed to have a crawling price value, the floor is determined to be the target floor, then the ranking value of each target floor is calculated through a preset ranking strategy, the target floor queue is crawled according to the sequence of the ranking values from large to small, and then the content of the target floor is crawled according to the sequence of the pages in the target floor queue to obtain an initial sentence.

The link analysis refers to the analysis of the basic characteristics of the floor, and mainly refers to the analysis of the user behavior, the floor content and the comment effectiveness of the floor.

Wherein, the preset ranking policy includes but is not limited to: PageRank strategy, Hilltop algorithm, link relation based ranking (TrustRank) algorithm, ExpertRank and the like.

Preferably, the present embodiment employs the PageRank policy for calculating the ranking value of each target floor.

In the embodiment, the initial sentences with high quality are obtained by analyzing and crawling the contents of the comment interaction floors, so that the accuracy of the subsequent generation of the tag system is improved.

In some optional implementation manners of this embodiment, in step S202, performing text preprocessing on the initial sentence, and obtaining the processed sentence includes:

and carrying out useless word extraction and labeling on the standard text to obtain a labeled processing sentence.

The initial sentences are subjected to case unification, capitalization character strings in the text are converted into lowercase character strings, the initial sentences are subjected to matching conversion through regular expressions, traditional Chinese characters are converted into simplified Chinese characters through a Chinese simplified and traditional Chinese character mapping list, the initial sentences are subjected to case unification and traditional Chinese character conversion, standard texts are obtained, and the recognition capability of the standard texts can be effectively improved.

The words are "words" and "stop words", and the like, such as "of", "you", "also", "learn" and the like.

It should be noted that some special characters, that is, character combinations that cannot be recognized may exist in some initial sentences, and in order to avoid that the special characters affect the overall recognition capability of the text, the embodiment further uses a preset special character recognition module. The special characters are cleaned, and the accuracy of subsequent word segmentation is improved.

In this embodiment, the special characters are mainly identified by regular expressions and loop recording.

In this embodiment, for a stop-word with a length different from 1, the stop-word may be identified through autonomous learning of a stop-word model, specifically, the stop-word model may be implemented by performing supervised learning training according to an existing (first preset and subsequently identified) stop-word in a manner of memorizing a neural network for a long period of time.

In the embodiment, the standard text is obtained by carrying out case unification and traditional transformation on the initial sentence, and then the useless word extraction and labeling are carried out on the standard text to obtain the labeled processing sentence, so that the sentence quality is improved, the interference of the irregular format and the useless word is avoided, and the accuracy of the subsequent word segmentation is improved.

In some optional implementation manners of this embodiment, obtaining a preset training corpus, analyzing the preset training corpus by using an N-gram model to obtain word sequence data of the preset training corpus, and in step S203, performing word segmentation on a processed sentence by using a preset word segmentation manner to obtain a target word segmentation, where the obtaining of the target word segmentation includes:

aiming at each word segmentation sequence, calculating the occurrence probability of each word segmentation sequence according to word sequence data of a preset training corpus to obtain the occurrence probability of M word segmentation sequences;

Specifically, the training corpus is a corpus obtained by training relevant corpora to evaluate basic sentences in natural language, and an N-gram model is used to perform statistical analysis on each corpus in a preset training corpus to obtain the number of times that one corpus H appears behind another corpus I in the preset training corpus, so as to obtain word sequence data of word sequences consisting of the corpora I + the corpora H. The content in the training corpus in the embodiment of the present invention includes, but is not limited to: professional information, network linguistic data, a general corpus and the like corresponding to the course comments or the course topics.

The Corpus (Corpus) refers to a large-scale electronic text library which is scientifically sampled and processed. The corpus is a basic resource of linguistic research and also a main resource of an empirical language research method, is applied to aspects such as lexicography, language teaching, traditional language research, statistics or example-based research in natural language processing and the like, and is a corpus, namely a language material, which is the content of the linguistic research and also is a basic unit for forming the corpus.

For example, in one embodiment, the preset training corpus is a corpus obtained by crawling in a web crawler manner through comments on popular web topics and current news, and in the field of "current news".

The Word sequence refers to a sequence formed by combining at least two linguistic data according to a certain sequence, the Word sequence frequency refers to the proportion of the occurrence frequency of the Word sequence to the occurrence frequency of Word Segmentation (Word Segmentation) in the whole corpus, and the Word Segmentation refers to a Word sequence obtained by combining continuous Word sequences according to a preset combination mode. For example, if the number of occurrences of a word sequence "love tomatoes" in the entire corpus is 100 times, and the sum of the number of occurrences of all the participles in the entire corpus is 100000 times, the frequency of the word sequence "love tomatoes" is 0.0001.

The N-gram model is a language model commonly used in large-vocabulary continuous character semantic recognition, and the sentence with the maximum probability can be calculated by utilizing collocation information between adjacent words in the context when the continuous blank-free characters need to be converted into Chinese character strings (namely sentences), so that the automatic conversion of Chinese characters is realized, manual selection of a user is not needed, and the accuracy of determining the word segmentation sequence is improved.

It should be noted that, in order to improve the word segmentation efficiency of the comment content, in this embodiment, a process of obtaining the word sequence data of the preset training corpus by obtaining the preset training corpus and analyzing the preset training corpus using the N-gram model may be performed before word segmentation, and the obtained word sequence data is stored, and when word segmentation is required, the word sequence data is directly called.

Further, performing word segmentation analysis on the basic sentence to obtain M word segmentation sequences, which specifically includes:

the sentence breaking modes of each basic sentence are different, the understood sentences may have differences, and in order to ensure the correctness of sentence understanding, the server side obtains the composition of M word segmentation sequences of the basic sentence after obtaining the initial sentence, wherein M is the total number of all possible word segmentation sequences.

Each word segmentation sequence is a result obtained by dividing a basic sentence, and the obtained word sequence comprises at least two word segmentations.

For example, in one embodiment, a base sentence is "today true hot", and the base sentence is parsed to obtain a word segmentation sequence a: "today", "true", "hot", the resulting segmentation sequence B is: "today", "Tianzhen", "hot", etc.

The generation probability of the segmentation word sequence can be calculated by using a Markov hypothesis theory: the occurrence of the Y-th word is only related to the previous Y-1 words, but not to any other words, and the probability of the whole sentence is the product of the occurrence probabilities of the words. These probabilities can be obtained by counting the number of times that Y words occur simultaneously directly from the corpus. Namely:

wherein the content of the first and second substances,

is the probability of the occurrence of the whole sentence,

is the probability that the Y-th participle appears after the word sequence consisting of Y-1 participles.

For example: after the Chinese nation is a nation with a long civilization history, the divided word sequence is as follows: the method comprises the steps of "Chinese nation", "is", "one", "having", "long", "civilization", "history", "being", "nationality", wherein 9 participles are appeared together, and when n =9, the probability that the participle of the "nation" appears after the word sequence of the "Chinese nation is a word with long civilization history" is calculated.

In this embodiment, for each word segmentation sequence, an occurrence probability is obtained through calculation, and the occurrence probabilities of M word segmentation sequences are obtained in total, the occurrence probabilities of the M word segmentation sequences are respectively compared with a preset probability threshold, the occurrence probability greater than or equal to the preset probability threshold is selected as an effective occurrence probability, and then word segmentation sequences corresponding to the effective occurrence probability are found, and the word segmentation sequences are used as target word segmentation sequences.

By comparing with a preset probability threshold value, the word segmentation sequences with the occurrence probability not meeting the requirement are filtered, so that the selected target word segmentation sequences are closer to the meaning expressed in the natural language, and the word segmentation accuracy is improved.

In the embodiment, the processed sentences are segmented through the pre-trained N-gram model, and the accuracy and efficiency of segmentation are improved.

In some optional implementation manners of this embodiment, after step S203, the method for generating a course label further includes:

if the spatial distance is smaller than a preset distance threshold, determining a group of word vectors corresponding to the spatial distance as near-meaning word vectors, and acquiring two target participles corresponding to the near-meaning word vectors as a group of near-meaning words;

Specifically, target participles in the comment interaction data are mapped into a vector according to a preset corpus, the vectors are connected together to form a word vector space, and each vector corresponds to a point in the space.

For example, two keywords, namely, a bmw and a gallop, are contained in a product name of a certain automobile sales company, and all possible classifications of the two keywords are obtained according to a preset corpus: "car", "luxury", "animal", "action", and "food". Therefore, a vector representation is introduced for these two keys:

< cars, luxuries, animals, actions, food >

Calculating the probability of the two target participles belonging to each classification according to a statistical learning method, wherein the probability learned by a computer is as follows:

bma = <0.5, 0.2, 0.2, 0.0, 0.1>

Gallop = <0.7, 0.2, 0.0, 0.1, 0.0>

It is understood that the value of each dimension of the base word vector represents a feature that has certain semantic and grammatical interpretations, and thus each dimension of the base word vector may be referred to as a target participle feature.

Further, a key word vector is constructed for each target word segmentation extracted from the comment sentences, and a basic word vector is obtained.

It should be noted that each target participle corresponds to a unique base word vector, and each base word vector corresponds to at least one target participle.

By constructing the basic word vector of each target word segmentation based on the preset corpus, characters which cannot be accurately understood by a machine are converted into word vectors which are easy to identify and operate by the machine, so that subsequent calculation through digitalization is facilitated, synonyms and near-synonyms are determined, data redundancy is reduced, and the efficiency of subsequent label classification is facilitated to be improved.

Further, for each basic word vector, a calculation formula of spatial distance is used to calculate spatial distances between the basic word vector and all other basic word vectors respectively, and the closer the distance is, the closer the semantics of the two basic word vectors are represented, so that synonym synonyms can be distinguished according to the distance in the following.

Calculating a base word vector according to equation (1)

And base word vector

Spatial distance L therebetween:

wherein n is a positive integer greater than or equal to 2.

For example, in one embodiment, the base word vector comprises

、

、

To aim at

Respectively calculated according to the formula (1) to obtain

To

Is 0.5659, and

to

Is 0.1414.

It is easy to understand that the closer the spatial distance between two basic word vectors is, the closer the semantics thereof are identified, and when the semantics are close to a certain degree, the target participles corresponding to the two basic word vectors can be determined to be synonyms or synonyms.

The preset distance threshold may be set according to actual requirements, and is not specifically limited herein.

In this embodiment, for each group of near-synonyms, one target participle is retained, and the other target participle is removed, so that merging of the near-synonyms is realized, and an updated target participle is obtained. Preferably, word frequency statistics is performed on each target participle in the similar meaning words, and the target participle with higher word frequency is reserved.

In the embodiment, after the target participle is obtained, the target participle is subjected to near-synonym merging and screening to obtain the updated target participle, so that interference of too many near-synonym synonyms on subsequent target participle evaluation is avoided, the operation amount is reduced, and the efficiency and the accuracy of the subsequent target participle evaluation are improved.

In some optional implementation manners of this embodiment, in step S206, classifying the secondary class labels into preset primary class labels in a clustering manner, and obtaining a target class label system of the target class includes:

performing word vector transformation on a preset first-level course label, and taking each obtained word vector as a clustering center;

and classifying the secondary course labels into target classes corresponding to the secondary course labels to obtain a target course label system.

Specifically, the first-level class labels are preset in this embodiment, word vectors corresponding to the first-level class labels are obtained by converting word vectors of the preset first-level class labels, and then the word vectors corresponding to the first-level class labels are used as a clustering center to cluster each second-level class label, and the second-level class labels are classified into the corresponding first-level class labels according to the clustering center to which the word vectors corresponding to the second-level class labels after clustering belong.

It should be noted that the second-level course labels in this embodiment are a preferred mode of this embodiment, and in an actual production application, a third-level label, a fourth-level label, or even more labels may be specifically generated, and the method provided in this embodiment is used for classifying the first-level course labels, which should not be construed as limiting.

In the embodiment, the labels are classified and classified in a word vector mode to form a label system, so that the accuracy of the label system is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 3 is a schematic block diagram of the generation apparatus of the course label in one-to-one correspondence with the generation method of the course label in the above-described embodiment. As shown in fig. 3, the curriculum label generating device includes a data collecting module 31, a preprocessing module 32, a word segmentation module 33, an evaluation module 34, a sorting module 35 and a system generating module 36. The functional modules are explained in detail as follows:

the data acquisition module 31 is used for acquiring interactive comment data of the target course to obtain an initial sentence;

a preprocessing module 32, configured to perform text preprocessing on the initial sentence to obtain a processed sentence;

the word segmentation module 33 is configured to perform word segmentation on the processed sentence in a preset word segmentation mode to obtain a target word segmentation;

the evaluation module 34 is configured to perform word frequency calculation and inverse text frequency index calculation on the target participles based on a TF-IDF algorithm, and determine an evaluation value of the target participles according to the obtained word frequency and the obtained inverse text frequency index;

the sorting module 35 is configured to sort the target segmented words according to the evaluation values of the target segmented words, and select the target segmented words with a preset threshold from front to back as a secondary course label;

and the system generation module 36 is configured to classify the second-level course tags into preset first-level course tags in a clustering manner, so as to obtain a target course tag system of the target course.

Optionally, the data acquisition module 31 includes:

the floor determining unit is used for determining a target floor according to the weight of each floor and a preset weight threshold;

Optionally, the preprocessing module 32 includes:

and the useless word extracting and labeling unit is used for extracting and labeling the useless words of the standard text to obtain the labeled processing sentences.

Optionally, the word segmentation module 33 includes:

the probability calculation unit is used for calculating the occurrence probability of each word segmentation sequence according to word sequence data of a preset training corpus aiming at each word segmentation sequence to obtain the occurrence probability of M word segmentation sequences;

Optionally, the device for generating a course label further comprises:

the vector grouping module is used for calculating the space distance between each basic word vector and each other basic word vector aiming at each basic word vector, and taking two basic word vectors corresponding to each space distance as a group of word vectors;

the near-meaning word determining module is used for determining a group of word vectors corresponding to the space distance as near-meaning word vectors if the space distance is smaller than a preset distance threshold, and acquiring two target word segments corresponding to the near-meaning word vectors as a group of near-meaning words;

Optionally, the system generation module 36 includes:

the vector conversion unit is used for carrying out word vector conversion on the preset first-level course labels and taking each obtained word vector as a clustering center;

the clustering center determining unit is used for acquiring a clustering center corresponding to the spatial distance with the minimum numerical value as a target clustering center for each secondary course label, and taking a preset primary course label corresponding to the target clustering center as a target category;

and the target course label system generating unit is used for classifying the secondary course labels into target categories corresponding to the secondary course labels according to each secondary course label to obtain a target course label system.

For the specific definition of the device for generating the course label, reference may be made to the above definition of the method for generating the course label, and details are not described herein again. The modules in the above-mentioned curriculum label generating device can be wholly or partially implemented by software, hardware and their combination. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer readable storage medium, wherein the computer readable storage medium stores an interface display program, and the interface display program can be executed by at least one processor, so as to cause the at least one processor to execute the steps of the method for generating course labels as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method for generating a course label, comprising:

merging each group of similar meaning words to obtain updated target participles;

2. The method for generating a course label as claimed in claim 1, wherein said collecting interactive comment data of the target course to obtain the initial sentence comprises:

3. The method for generating lesson tags as claimed in claim 1, wherein said pre-processing said initial sentence to obtain a processed sentence comprises:

4. The method for generating lesson labels as claimed in claim 1, wherein a predetermined training corpus is obtained, and the predetermined training corpus is analyzed using an N-gram model to obtain word sequence data of the predetermined training corpus;

the word segmentation processing is performed on the processed sentence by adopting a preset word segmentation mode, and the target word segmentation is obtained by: performing word segmentation analysis on the processed sentences to obtain M word segmentation sequences;

5. The method as claimed in any one of claims 1 to 4, wherein the step of classifying the secondary class labels into preset primary class labels by clustering to obtain the target class label system of the target class comprises:

for each secondary class label, acquiring a clustering center corresponding to the spatial distance with the minimum numerical value as a target clustering center, and taking a preset primary class label corresponding to the target clustering center as a target category; and classifying the secondary course labels into target classes corresponding to the secondary course labels aiming at each secondary course label to obtain a target course label system.

6. An apparatus for generating a course label, comprising:

the vector construction module is used for constructing a basic word vector of each target word segmentation based on a preset corpus; the vector grouping module is used for calculating the space distance between the basic word vector and each other basic word vector aiming at each basic word vector, and taking two basic word vectors corresponding to each space distance as a group of word vectors;

the word segmentation merging module is used for merging each group of near-meaning words to obtain updated target word segmentation;

the evaluation module is used for calculating word frequency and inverse text frequency index of the target participles respectively based on TF-IDF algorithm aiming at each target participle, and determining the evaluation value of the target participles according to the word frequency and the inverse text frequency index;

7. The curriculum label generation apparatus of claim 6, wherein the data collection module comprises:

the weight determining unit is used for determining the floor weight of each comment interaction floor in a link analysis mode; the floor determining unit is used for determining a target floor according to each floor weight and a preset weight threshold; the floor sequencing unit is used for calculating the ranking value of each target floor based on a preset ranking strategy, and sequencing the target floors according to the sequence of the ranking values from large to small to obtain a target floor queue; the content grabbing unit is used for grabbing the content in the target floor based on the target floor queue to obtain the initial sentence;

the preprocessing module comprises:

the text conversion unit is used for carrying out case unification and traditional body conversion on the initial sentence to obtain a standard text; and the useless word extracting and labeling unit is used for extracting and labeling the useless words of the standard text to obtain a labeled processing statement.

8. A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the course tag generation method as claimed in any one of claims 1 to 5 when executing said computer program.

9. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements a method for generating course tags according to any one of claims 1 to 5.