CN117574892A - Text position analysis method, device, equipment and storage medium - Google Patents

Text position analysis method, device, equipment and storage medium Download PDF

Info

Publication number
CN117574892A
CN117574892A CN202311492828.9A CN202311492828A CN117574892A CN 117574892 A CN117574892 A CN 117574892A CN 202311492828 A CN202311492828 A CN 202311492828A CN 117574892 A CN117574892 A CN 117574892A
Authority
CN
China
Prior art keywords
text
phrase
analyzed
word
capsule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311492828.9A
Other languages
Chinese (zh)
Inventor
张传新
张旭
张翔宇
何扬
陈彤
解峥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN202311492828.9A priority Critical patent/CN117574892A/en
Publication of CN117574892A publication Critical patent/CN117574892A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to a text position analysis method, apparatus, device, and storage medium, the method comprising: acquiring a text to be analyzed and a theme phrase corresponding to the text to be analyzed; extracting features of the text to be analyzed and the topic phrase to obtain a text vector and a topic vector; inputting the text vector and the theme vector into a pre-trained layered capsule model to obtain an analysis result which is output by the layered capsule model and used for performing standing analysis on the text to be analyzed according to the theme phrase, wherein the layered capsule model comprises a word capsule layer, a sentence capsule layer and a category capsule layer, and the word capsule layer, the sentence capsule layer and the category capsule layer are respectively used for extracting word-level features, sentence-level features and category features of the text to be analyzed. According to the method and the device, the layered capsule model is arranged, so that characteristic information of different layers of the text can be extracted, deep information of the text is mined, and the accuracy of standing judgment on the text is improved.

Description

Text position analysis method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a text position analysis method, device, apparatus, and storage medium.
Background
Text position analysis is an important technology for researching social media public opinion trends and assisting business decisions, aiming at analyzing emotion tendencies (supporting, opposing, neutral) expressed in a piece of text for a specified target or topic.
The conventional text position analysis method generally adopts emotion analysis technology, and solves the position analysis problem by detecting emotion words in a text, such as happiness, sadness, anger and the like and combining text characteristics, but for the text containing the congratulation techniques such as irony, metaphor, complex emotion expression and the like, deep representation information cannot be fully mined, and position misjudgment or missed judgment is easily caused.
Disclosure of Invention
In order to solve the technical problems, the disclosure provides a text position analysis method, a text position analysis device, text position analysis equipment and a storage medium.
A first aspect of an embodiment of the present disclosure provides a text position analysis method, the method including:
acquiring a text to be analyzed and a theme phrase corresponding to the text to be analyzed;
extracting features of the text to be analyzed and the topic phrase to obtain a text vector and a topic vector;
inputting the text vector and the theme vector into a pre-trained layered capsule model to obtain an analysis result which is output by the layered capsule model and used for performing standing analysis on the text to be analyzed according to the theme phrase, wherein the layered capsule model comprises a word capsule layer, a sentence capsule layer and a category capsule layer, and the word capsule layer, the sentence capsule layer and the category capsule layer are respectively used for extracting word-level features, sentence-level features and category features of the text to be analyzed.
A second aspect of embodiments of the present disclosure provides a text position analysis apparatus, the apparatus comprising:
the acquisition module is used for acquiring the text to be analyzed and the topic phrase corresponding to the text to be analyzed;
the extraction module is used for extracting the characteristics of the text to be analyzed and the topic phrase to obtain a text vector and a topic vector;
the analysis module is used for inputting the text vector and the theme vector into a pre-trained layered capsule model to obtain an analysis result which is output by the layered capsule model and used for performing standing analysis on the text to be analyzed according to the theme phrase, the layered capsule model comprises a word capsule layer, a sentence capsule layer and a category capsule layer, and the word capsule layer, the sentence capsule layer and the category capsule layer are respectively used for extracting word-level features, sentence-level features and category features of the text to be analyzed.
A third aspect of the disclosed embodiments provides a computer device comprising a memory and a processor, and a computer program, wherein the memory stores the computer program, which when executed by the processor, implements the text position analysis method as described above for the first aspect.
A fourth aspect of embodiments of the present disclosure provides a computer-readable storage medium having a computer program stored therein, which when executed by a processor, implements a text position analysis method as in the first aspect described above.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the text position analysis method, device and equipment and storage medium, the text to be analyzed and the topic phrases corresponding to the text to be analyzed are obtained, feature extraction is carried out on the text to be analyzed and the topic phrases to obtain text vectors and topic vectors, the text vectors and the topic vectors are input into a pre-trained layered capsule model to obtain analysis results which are output by the layered capsule model and are analyzed according to the topic phrases in the position of the text to be analyzed, the layered capsule model comprises word capsule layers, sentence capsule layers and category capsule layers, the word capsule layers, the sentence capsule layers and the category capsule layers are respectively used for extracting word-level features, sentence-level features and category features of the text to be analyzed, feature information of different levels can be extracted from the text to be analyzed through setting the layered capsule model, and further deep representation information of the text is dug.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow chart of a text position analysis method provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart of a method of determining a position analysis result provided by an embodiment of the present disclosure;
FIG. 3 is a flow chart of a method of determining a subject phrase provided by an embodiment of the present disclosure;
FIG. 4 is a flow chart of a method of text preprocessing provided by an embodiment of the present disclosure;
FIG. 5 is a flow chart of a method of screening phrase collections provided by an embodiment of the present disclosure;
FIG. 6 is a flow chart of a method of calculating a relevancy score provided by an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a text position analysis device according to an embodiment of the present disclosure;
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
Fig. 1 is a flow chart of a text position analysis method provided by an embodiment of the present disclosure, which may be performed by a text position analysis apparatus. As shown in fig. 1, the text position analysis method provided in this embodiment includes the following steps:
S101, acquiring a text to be analyzed and a theme phrase corresponding to the text to be analyzed.
The topic phrase in the embodiments of the present disclosure may be understood as a multi-vocabulary phrase extracted from the text to be analyzed that can reflect the topic of the text to be analyzed.
In the embodiment of the disclosure, the text position analysis device can acquire the text to be analyzed which needs position analysis and the subject phrase which can reflect the subject of the text to be analyzed.
In an exemplary implementation manner of the embodiment of the present disclosure, the text position analysis device may obtain a text to be analyzed uploaded by a user, and obtain a subject phrase corresponding to the text to be analyzed input by the user.
In another exemplary implementation of the disclosed embodiments, the text position analysis device may extract the subject phrase from the text to be analyzed after obtaining the text to be analyzed uploaded by the user.
S102, extracting features of the text to be analyzed and the subject phrase to obtain a text vector and a subject vector.
In the embodiment of the disclosure, the text position analysis device may perform feature extraction on the text to be analyzed and the topic phrase after obtaining the text to be analyzed and the topic phrase corresponding to the text to be analyzed, so as to obtain a text vector corresponding to the text to be analyzed and a topic vector corresponding to the topic phrase.
In one exemplary implementation of the disclosed embodiments, the text standpoint analysis device may input the text and subject phrase to be analyzed into a pre-trained feature extraction model, such as a bi-directional transducer-based encoder representation (Bidirectional Encoder Representation from Transformers Base Chinese, BERT Base Chinese) model that is pre-trained using chinese, from which the text and subject phrase to be analyzed are encoded to obtain a text vector and a subject vector.
S103, inputting the text vector and the theme vector into a pre-trained layered capsule model to obtain an analysis result which is output by the layered capsule model and used for standing analysis of the text to be analyzed according to the theme phrase, wherein the layered capsule model comprises a word capsule layer, a sentence capsule layer and a category capsule layer which are respectively used for extracting word-level features, sentence-level features and category features of the text to be analyzed.
The layered capsule model in the embodiment of the disclosure may be understood as a model capable of extracting text features from different levels and finally determining a text position, and the layered capsule model is composed of three capsule layers, namely a word capsule layer, a sentence capsule layer and a category capsule layer, each capsule layer contains a plurality of capsules for learning text granularity features corresponding to the current capsule layer, wherein the number of word capsules and sentence capsules is equal to the number of words and sentences contained in the text to be analyzed respectively, the number of category capsules is equal to the number of position categories, and the position categories may include support, opposition and neutrality. The training process of the layered capsule model is as follows: preparing training data, wherein the training data comprises a large number of training texts and corresponding standing labels, the standing labels are obtained by manual labeling, the layered capsule network is supervised and trained by the training data, loss values corresponding to the layered capsule network are calculated based on output results and the corresponding standing labels, network parameters are continuously adjusted until loss converges, and at the moment, the layered capsule model can be determined to be trained.
In the embodiment of the disclosure, the text position analysis device may input the text vector and the topic vector into a pre-trained layered capsule model after obtaining the text vector and the topic vector, so as to obtain an analysis result of position analysis of the text to be analyzed according to the topic phrase output by the layered capsule model.
According to the embodiment of the disclosure, the text to be analyzed and the topic phrase corresponding to the text to be analyzed are obtained, the feature extraction is carried out on the text to be analyzed and the topic phrase to obtain the text vector and the topic vector, the text vector and the topic vector are input into a pre-trained layered capsule model to obtain an analysis result which is output by the layered capsule model and is analyzed on the basis of the topic phrase to be analyzed, the layered capsule model comprises a word capsule layer, a sentence capsule layer and a category capsule layer, the word capsule layer, the sentence capsule layer and the category capsule layer are respectively used for extracting word-level features, sentence-level features and category features of the text to be analyzed, the feature information of different levels can be extracted from the text to be analyzed through setting the layered capsule model, further deep representation information of the text is excavated, compared with the traditional method of judging the position of an author through identifying emotion words, word features can be identified from the bottom layer to the middle layer to form phrase features, and sentence semantics are integrated to the top layer, so that the overall grasp of the multi-granularity text features is realized, and the position judgment accuracy is improved.
Fig. 2 is a flowchart of a method for determining a position analysis result according to an embodiment of the present disclosure, and as shown in fig. 2, the position analysis result may be determined as follows based on the above-described embodiment.
S201, inputting the text vector and the theme vector into a word capsule layer of the hierarchical capsule model to obtain word level characteristics output by the word capsule layer.
In the embodiment of the disclosure, the text position analysis device may splice the text vector and the topic vector after obtaining the text vector and the topic vector, input the spliced result into a word capsule layer of the hierarchical capsule model, extract word-level features from each word capsule contained in the word capsule layer, and output all the extracted word-level features in a combined manner. This process can be expressed as follows:
u i =Capsule(h i,j )∈R k×d
where i represents the index of the word in the text, j represents the index of the capsule in the word capsule, h i =[x i :t i ],x i Representing text vectors, t i Representing the topic vector, h i Representing the concatenation result of text vector and topic vector, h i,j For the representation of the ith word in the word capsule layer, u i The method is characterized in that word level features extracted by the ith word Capsule are represented, capsule (·) represents a Capsule function, k represents the number of word capsules, d represents vector dimensions of each word Capsule, and d is a preset value.
S202, inputting word-level features into sentence capsule layers of the layered capsule model, and determining the sentence-level features by combining the word-level features and the first weights acquired in advance based on the word capsule layers.
The first weight in the embodiment of the disclosure may be understood as a weight parameter that is transmitted to a sentence capsule according to a word capsule predetermined by an inter-capsule dynamic routing algorithm of a capsule network, where the dynamic routing algorithm is mainly used for information transmission, and a path and a weight of information transmission are determined by calculating a similarity between capsules contained in a current capsule layer, so that a relationship between capsules is better captured.
In the embodiment of the disclosure, the text position analysis device may input the word-level features into the sentence capsule layer of the layered capsule model after obtaining the word-level features output by the word capsule layer, extract the sentence-level features by each sentence capsule of the sentence capsule layer based on the word-level features and the first weights corresponding to each word-level feature, and combine and output all the extracted sentence-level features. This process can be expressed as follows:
wherein v is i Representing sentence-level features extracted by the ith sentence capsule, u i,j Representing word-level features extracted from a word capsule, in particular, word-level features of the ith word in the jth word capsule, c i,j Representing word-level features u i,j The corresponding word capsule is transferred to v i The first weight of the corresponding sentence capsule.
S203, inputting the sentence-level features into a category capsule layer of the layered capsule model, and determining category features based on the category capsule layer by combining the sentence-level features and the second weights acquired in advance.
The second weight in the embodiments of the present disclosure may be understood as a weight parameter that is transferred to the class capsule according to a sentence capsule predetermined by an inter-capsule dynamic routing algorithm of the capsule network.
In the embodiment of the disclosure, the text position analysis device may input the sentence-level features into the category capsule layer of the layered capsule model after obtaining the sentence-level features output by the sentence capsule layer, extract the category features by each category capsule of the category capsule layer based on the sentence-level features and the second weights corresponding to each sentence-level feature, and output all the extracted category features in combination. This process can be expressed as follows:
wherein s is i Representing class characteristics extracted from the ith class of capsules, v i,j Sentence-representing glueBag extracted sentence-level features, in particular sentence-level features of the ith sentence in the jth sentence capsule, b i,j Representing sentence-level features v i,j The corresponding sentence capsule is transferred to s i And a second weight of the corresponding class capsule.
In one exemplary implementation of the disclosed embodiments, the first weight c i,j Second weight b i,j Can be determined by the following method:
wherein a is i,j,l As an auxiliary parameter, the word capsule u is represented i,j Delivery to sentence capsule v i,l The second weight is initialized to 0, L represents the number of class capsules, and represents the vector dot product. During the training process of the layered capsule model, the first weight c is iteratively adjusted i,j Second weight b i,j And auxiliary parameter a i,j,l Until the model converges, and extracting c from the converged layered capsule model i,j 、b i,j The final first weight and second weight are determined.
S204, determining the standing category corresponding to the category feature, and determining the standing category as an analysis result of standing analysis on the text to be analyzed according to the subject phrase.
In the embodiment of the disclosure, after obtaining the class feature output by the class capsule layer, the text position analysis device may determine the corresponding position class according to the class feature, specifically, the class feature output by the class capsule layer is in the form of a feature vector with three dimensions, each dimension corresponds to one position class, the value range of each dimension is 0 to 1, the confidence level of the current position class is represented, the position class corresponding to the dimension with the highest confidence level is the finally determined position class, and the position class is determined as the analysis result of position analysis on the text to be analyzed according to the subject phrase. For example, the category feature may be (0.4,0.3,0.8), where three dimensions of the category feature are support, objection, neutral in sequence, and then the position category corresponding to the position feature is neutral.
According to the embodiment of the disclosure, the text vector and the theme vector are input into the word capsule layer of the layered capsule model to obtain word level features output by the word capsule layer, the word level features are input into the sentence capsule layer of the layered capsule model, the sentence level features are determined based on the word capsule layer and combined with the word level features and the first weight acquired in advance, the sentence level features are input into the category capsule layer of the layered capsule model, the category features are determined based on the category capsule layer and combined with the sentence level features and the second weight acquired in advance, the position category corresponding to the category features is determined, the position category is determined as an analysis result of position analysis on the text to be analyzed according to the theme phrase, the word level, the sentence level and the category features can be extracted, and then the position category is determined according to the category features, so that the finally determined position category comprehensively considers the text features of different levels, and the position analysis accuracy is improved.
Fig. 3 is a flowchart of a method for determining a subject phrase according to an embodiment of the present disclosure, and as shown in fig. 3, the subject phrase may be determined by the following method on the basis of the above-described embodiment.
S301, preprocessing the text to be analyzed to obtain a set of phrases contained in the text to be analyzed.
In the embodiment of the disclosure, the text position analysis device may preprocess the text to be analyzed after obtaining the text to be analyzed, determine each phrase contained in the text to be analyzed, and create a set for the phrases.
In an exemplary implementation manner of the embodiment of the disclosure, the text position analysis device may identify attributes of each vocabulary in the text, and extract a phrase composition set from the text to be analyzed according to an attribute structure of a preset phrase.
S302, determining the association degree score of the phrase and a preset theme and the importance degree score of the phrase aiming at each phrase contained in the phrase set, and calculating the product of the association degree score and the importance degree score.
Preset topics in the embodiments of the present disclosure may be understood as one or more topics preset.
Importance scores in embodiments of the present disclosure may be understood as scores that characterize the importance of a phrase in the text to be analyzed, taking the importance of the phrase in other text as a baseline reference.
In the embodiment of the disclosure, after determining the phrase set included in the text to be analyzed, the text position analysis device may calculate, for each phrase in the phrase set, a relevance score of the phrase and a preset topic, and an importance score of the phrase, and calculate a product of the relevance score and the importance score.
In an exemplary implementation manner of the embodiment of the present disclosure, the text position analysis device may input the phrase and the preset subject into a pre-trained association model to obtain a relevance score of the model output, and input the phrase into a pre-trained importance model to obtain a importance score of the model output.
S303, determining the phrase with the largest corresponding product as the subject phrase.
In the embodiment of the disclosure, the text position analysis device may select a maximum product after calculating a product of the relevance score and the importance score, and determine a phrase corresponding to the product as the subject phrase.
According to the embodiment of the disclosure, the text to be analyzed is preprocessed to obtain the set of phrases contained in the text to be analyzed, the relevance score of the phrases and the preset theme and the importance score of the phrases are determined for each phrase contained in the set of phrases, the product of the relevance score and the importance score is calculated, the phrase with the largest corresponding product is determined to be the theme phrase, the theme phrase with the highest score can be comprehensively determined from the two angles of the relevance with the main body and the importance of the phrase in the text to be analyzed, the automatic determination of the theme phrase is realized, and meanwhile the accuracy of text position analysis is further improved.
Fig. 4 is a flow chart of a method of text preprocessing provided by an embodiment of the present disclosure. As shown in fig. 4, on the basis of the above-described embodiment, text preprocessing may be performed as follows.
S401, word segmentation processing is carried out on the text to be analyzed.
In the embodiment of the disclosure, the text position analysis device may perform word segmentation on the text to be analyzed after obtaining the text to be analyzed, and specifically, may use multiple methods such as shortest path word segmentation, n-gram word segmentation, word segmentation by word formation, cyclic neural network word segmentation, and Transformer word segmentation, which are not limited herein.
S402, determining a set of phrase compositions containing a preset number of continuous words based on word segmentation results.
The preset number in the embodiment of the disclosure may be understood as the number of words contained in the phrase, and the value of the preset number may be set by the user, or may be a fixed value set empirically, which is not limited herein.
In the embodiment of the disclosure, the text position analysis device can determine a preset number of continuous words as a phrase after the word segmentation result is obtained, and create a set for the phrase.
In an exemplary implementation manner of the embodiment of the present disclosure, the text position analysis device may determine a phrase including a preset number of continuous words using an n-gram model, where the n-gram model may perform a sliding window operation on contents in the text according to words with a size of n to form a sequence of word segments with a length of n, where each word segment is called a gram and corresponds to a phrase.
According to the embodiment of the disclosure, the word segmentation is carried out on the text to be analyzed, and the set of phrase compositions containing the preset number of continuous words is determined based on the word segmentation result, so that the phrases containing the preset number of words can be automatically determined, and the subsequent screening of the subject phrases is facilitated.
Fig. 5 is a flowchart of a method for screening phrase sets according to an embodiment of the present disclosure, where the phrase sets may be screened as follows, as shown in fig. 5, on the basis of the above embodiment.
S501, word frequency statistics is carried out based on word segmentation results.
In the embodiment of the disclosure, the text position analysis device may perform word frequency statistics on each word based on the word segmentation result after obtaining the word segmentation result of the text to be analyzed, so as to determine the occurrence number of each word in the text to be analyzed.
S502, determining a low-frequency vocabulary with the word frequency lower than a preset threshold value based on the word frequency statistical result.
The preset threshold in the embodiment of the present disclosure may be understood as a preset threshold for determining the number of low-frequency words.
In the embodiment of the disclosure, the text position analysis device may determine, after obtaining the word frequency statistics result, a word with a word frequency lower than a preset threshold as a low-frequency word according to a word frequency of each word.
S503, eliminating phrases containing low-frequency words and/or punctuation marks from phrases containing a preset number of continuous words, and forming a set based on the rest phrases.
In the embodiment of the disclosure, the text position analysis device may reject, after determining the phrases including the preset number of continuous words, the phrases including at least one of the low-frequency words and punctuation marks according to the words that constitute each phrase, and add the remaining phrases that do not include the low-frequency words and the punctuation marks to the created phrase set.
According to the embodiment of the disclosure, word frequency statistics is carried out based on word frequency statistics results, low-frequency words with word frequency lower than the preset threshold value are determined based on the word frequency statistics results, phrases containing low-frequency words and/or punctuation marks are removed from phrases containing a preset number of continuous words, and phrases which are not suitable for expressing the subject of the text to be analyzed can be removed based on the rest of phrase composition sets, so that the matching degree of the subject phrases and the text to be analyzed is improved.
Fig. 6 is a flowchart of a method for calculating a relevance score according to an embodiment of the present disclosure, and as shown in fig. 6, the relevance score may be calculated according to the following method on the basis of the above embodiment.
S601, determining a first association degree of a preset theme and a text to be analyzed based on a pre-trained latent Dirichlet distribution model.
The first degree of association in the embodiments of the present disclosure may be understood as a topic distribution P (z|d) of the text to be analyzed for a preset topic, where z represents the preset topic and D represents the text to be analyzed.
In an embodiment of the disclosure, the text standpoint analysis device may input a pre-trained latent dirichlet allocation (Latent Dirichlet Allocation, LDA) model of a preset theme and a text to be analyzed, and determine a first degree of association based on a model output, where a first degree of association calculation formula is as follows:
wherein, P (z) represents the probability of occurrence of the topic z in the whole corpus, namely the ratio of the occurrence times of the topic z in the corpus to the number of all topics contained in the corpus, P (d|z) represents the probability of generating the text D according to the topic z, the preset topic and the text to be analyzed can be input into the LDA model to obtain the probability of generating the text D according to the topic z output by the model, P (D) represents the probability of generating the text D according to the topic z and is also determined based on the LDA model, and the text to be analyzed is contained in the corpus.
S602, determining a second association degree of the phrase and the preset theme based on the first occurrence times of the phrase in the preset text corresponding to the preset theme and the sum of the second occurrence times of each phrase in the set in the preset text.
In the embodiment of the disclosure, the LDA model uses each topic and the text corresponding to the topic as training data in the training process, the preset topic corresponds to the preset text, the first occurrence number is the occurrence number of the phrase currently needing to be calculated and scored in the preset text corresponding to the preset topic, and the second occurrence number is the occurrence number of each phrase in the phrase set in the preset text corresponding to the preset topic.
In the embodiment of the disclosure, the text position analysis device may obtain the first occurrence number of the phrase to be scored in the preset text corresponding to the preset theme and the second occurrence number of each phrase in the set of phrases in the preset text corresponding to the preset theme, so as to determine the second association degree of the phrase and the preset theme according to the sum of the first occurrence number and each second occurrence number. The second association degree calculation formula is as follows:
wherein P (g) i Z) represents the phrase g i A second degree of association with a preset theme z, n z,gi Representing phrase g i The first number of occurrences in the preset text corresponding to the preset theme z, β being a smooth term for preventing occurrence of the second association degree of 0, usually takes a smaller value, e.g., β=0.01, n z,j The second occurrence times of the j-th phrase in the phrase set in the preset text corresponding to the preset theme z are represented, and V represents the number of phrases contained in the phrase set.
S603, calculating a relevance score of the phrase and a preset theme based on the first relevance and the second relevance.
In the embodiment of the disclosure, the text position analysis device may calculate the association degree score of the phrase and the preset topic after determining the first association degree of the preset topic and the text to be analyzed and the second association degree of the phrase and the preset topic. The specific calculation formula is as follows:
wherein score topic (g i ) Representing phrase g i And (3) grading the association degree with the preset topics, wherein K represents the number of the preset topics.
According to the method and the device for determining the relevance score of the phrases and the preset topics, the first relevance of the preset topics and the text to be analyzed is determined based on the pre-trained latent dirichlet distribution model, the second relevance of the phrases and the preset topics is determined based on the sum of the first occurrence times of the phrases in the preset text corresponding to the preset topics and the second occurrence times of the phrases in the set in the preset text, the relevance score of the phrases and the preset topics is calculated based on the first relevance and the second relevance, the relation of the topics and the text to be analyzed and the relation of the topics and the phrases can be comprehensively considered, and the relevance score of the phrases and the topics is determined, so that the finally determined topic phrases can better reflect the topics of the text to be analyzed.
In some embodiments of the present disclosure, the text position analysis device may calculate the importance score of the phrase based on a third number of occurrences of the phrase in the text to be analyzed and a fourth number of occurrences of the phrase in a preset corpus, where the preset corpus contains the text to be analyzed.
Specifically, the text position analysis device may count the number of occurrences of the phrase in the text to be analyzed, determine the number of occurrences as a third number of occurrences, count the number of occurrences of the phrase in a preset corpus containing the text to be analyzed, determine the number of occurrences as a fourth number of occurrences, and calculate an importance score corresponding to the phrase after determining the third number of occurrences and the fourth number of occurrences. The specific calculation formula is as follows:
wherein score quality (g i ) Representing phrase g i Importance score, tf (g) i D) represents the phrase g i A third number of occurrences in the text D to be analyzed,representing phrase g i In the preset corpus->The fourth number of occurrences of n represents the total number of words in the text to be analyzed, |g i I represents the phrase g i The number of words involved, +.>Representing the j-th vocabulary to j+|g in the text to be analyzed i Phrase consisting of 1 vocabulary, +.>Is an indication function whenThe value is 1 when the time is, otherwise, the value is 0.
Fig. 7 is a schematic structural diagram of a text position analysis device according to an embodiment of the present disclosure. As shown in fig. 7, the text position analysis device 700 includes: the device comprises an acquisition module 710, an extraction module 720 and an analysis module 730, wherein the acquisition module 710 is used for acquiring a text to be analyzed and a subject phrase corresponding to the text to be analyzed; the extracting module 720 is configured to perform feature extraction on the text to be analyzed and the topic phrase to obtain a text vector and a topic vector; the analysis module 730 is configured to input the text vector and the topic vector into a pre-trained layered capsule model, so as to obtain an analysis result output by the layered capsule model for performing a standpoint analysis on the text to be analyzed according to the topic phrase, where the layered capsule model includes a word capsule layer, a sentence capsule layer, and a category capsule layer, and the word capsule layer, the sentence capsule layer, and the category capsule layer are respectively used for extracting word-level features, sentence-level features, and category features of the text to be analyzed.
Optionally, the analysis module 730 includes: the first extraction unit is used for inputting the text vector and the theme vector into a word capsule layer of the layered capsule model to obtain the word level characteristics output by the word capsule layer; the second extraction unit is used for inputting the word-level features into a sentence capsule layer of the layered capsule model, and determining the sentence-level features by combining the word-level features with the first weights acquired in advance based on the word capsule layer; a third extraction unit, configured to input the sentence-level feature into a category capsule layer of the layered capsule model, and determine the category feature based on the category capsule layer by combining the sentence-level feature with a second weight acquired in advance; the first determining unit is used for determining the position category corresponding to the category characteristic and determining the position category as an analysis result of position analysis on the text to be analyzed according to the theme phrase.
Optionally, the acquiring module 710 includes: the preprocessing unit is used for preprocessing the text to be analyzed to obtain a set of phrases contained in the text to be analyzed; a calculating unit, configured to determine, for each phrase included in the set of phrases, a relevance score of the phrase and a preset topic, and an importance score of the phrase, and calculate a product of the relevance score and the importance score; and the second determining unit is used for determining the phrase with the largest corresponding product as the theme phrase.
Optionally, the preprocessing unit includes: the word segmentation subunit is used for carrying out word segmentation processing on the text to be analyzed; and the first determination subunit is used for determining a set of phrase compositions containing a preset number of continuous words based on the word segmentation result.
Optionally, the first determining subunit includes: the statistics sub-subunit is used for carrying out word frequency statistics based on the word segmentation result; the sub-subunit is used for determining a low-frequency vocabulary with the word frequency lower than a preset threshold value based on the word frequency statistical result; and the eliminating subunit is used for eliminating phrases containing the low-frequency vocabulary and/or punctuation marks from the phrases containing the preset number of continuous vocabularies, and forming the set based on the rest phrases.
Optionally, the computing unit includes: the second determining subunit is used for determining a first association degree between the preset theme and the text to be analyzed based on a pre-trained latent dirichlet distribution model; a third determining subunit, configured to determine a second association degree between the phrase and the preset topic based on a first occurrence number of the phrase in a preset text corresponding to the preset topic and a sum of second occurrence numbers of each phrase included in the set in the preset text; and the calculating subunit is used for calculating the relevance scores of the phrases and the preset theme based on the first relevance and the second relevance.
Optionally, the calculating unit is specifically configured to calculate the importance score of the phrase based on the third occurrence number of the phrase in the text to be analyzed and the fourth occurrence number of the phrase in a preset corpus, where the preset corpus includes the text to be analyzed.
The text position analysis device provided in this embodiment can execute the method described in any of the foregoing embodiments, and the execution manner and the beneficial effects thereof are similar, and are not repeated here.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
As shown in fig. 8, the computer device may include a processor 810 and a memory 820 storing computer program instructions.
In particular, the processor 810 described above may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.
Memory 820 may include mass storage for information or instructions. By way of example, and not limitation, memory 820 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of these. Memory 820 may include removable or non-removable (or fixed) media, where appropriate. The memory 820 may be internal or external to the integrated gateway device, where appropriate. In a particular embodiment, the memory 820 is a non-volatile solid state memory. In a particular embodiment, the Memory 820 includes Read-Only Memory (ROM). The ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (Electrical Programmable ROM, EPROM), electrically erasable PROM (Electrically Erasable Programmable ROM, EEPROM), electrically rewritable ROM (Electrically Alterable ROM, EAROM), or flash memory, or a combination of two or more of these, where appropriate.
The processor 810 performs the steps of the text position analysis method provided by the embodiments of the present disclosure by reading and executing computer program instructions stored in the memory 820.
In one example, the computer device may also include a transceiver 830 and a bus 840. In which, as shown in fig. 8, the processor 810, the memory 820 and the transceiver 830 are connected and communicate with each other through a bus 840.
Bus 840 includes hardware, software, or both. By way of example, and not limitation, the buses may include an accelerated graphics port (Accelerated Graphics Port, AGP) or other graphics BUS, an enhanced industry standard architecture (Extended Industry Standard Architecture, EISA) BUS, a Front Side BUS (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industrial Standard Architecture, ISA) BUS, an InfiniBand interconnect, a Low Pin Count (LPC) BUS, a memory BUS, a micro channel architecture (Micro Channel Architecture, MCa) BUS, a peripheral control interconnect (Peripheral Component Interconnect, PCI) BUS, a PCI-Express (PCI-X) BUS, a serial advanced technology attachment (Serial Advanced Technology Attachment, SATA) BUS, a video electronics standards association local (Video Electronics Standards Association Local Bus, VLB) BUS, or other suitable BUS, or a combination of two or more of these. Bus 840 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.
The present disclosure also provides a computer-readable storage medium, which may store a computer program that, when executed by a processor, causes the processor to implement the text standpoint analysis method provided by the embodiments of the present disclosure.
The storage medium described above may, for example, include a memory 820 of computer program instructions executable by a processor 810 of a text position analysis device to perform the text position analysis method provided by embodiments of the present disclosure. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, a ROM, a random access memory (Random Access Memory, RAM), a Compact Disc ROM (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like. The computer programs described above may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A text position analysis method, the method comprising:
acquiring a text to be analyzed and a theme phrase corresponding to the text to be analyzed;
extracting features of the text to be analyzed and the topic phrase to obtain a text vector and a topic vector;
inputting the text vector and the theme vector into a pre-trained layered capsule model to obtain an analysis result which is output by the layered capsule model and used for performing standing analysis on the text to be analyzed according to the theme phrase, wherein the layered capsule model comprises a word capsule layer, a sentence capsule layer and a category capsule layer, and the word capsule layer, the sentence capsule layer and the category capsule layer are respectively used for extracting word-level features, sentence-level features and category features of the text to be analyzed.
2. The method according to claim 1, wherein the inputting the text vector and the topic vector into a pre-trained hierarchical capsule model to obtain an analysis result output by the hierarchical capsule model for the standpoint analysis of the text to be analyzed according to the topic phrase includes:
inputting the text vector and the theme vector into a word capsule layer of the hierarchical capsule model to obtain the word level characteristics output by the word capsule layer;
Inputting the word-level features into a sentence capsule layer of the layered capsule model, and determining the sentence-level features by combining the word-level features with a first weight acquired in advance based on the word capsule layer;
inputting the sentence-level features into a category capsule layer of the layered capsule model, and determining the category features based on the category capsule layer by combining the sentence-level features with a second weight acquired in advance;
and determining the position category corresponding to the category characteristic, and determining the position category as an analysis result of position analysis on the text to be analyzed according to the subject phrase.
3. The method of claim 1, wherein obtaining the topic phrase corresponding to the text to be analyzed comprises:
preprocessing the text to be analyzed to obtain a set of phrases contained in the text to be analyzed;
determining a relevance score of the phrase and a preset theme and an importance score of the phrase for each phrase contained in the phrase set, and calculating the product of the relevance score and the importance score;
and determining the phrase with the largest corresponding product as the subject phrase.
4. A method according to claim 3, wherein the preprocessing the text to be analyzed to obtain a set of phrases contained in the text to be analyzed comprises:
word segmentation processing is carried out on the text to be analyzed;
based on the word segmentation result, a set of phrase compositions comprising a preset number of consecutive words is determined.
5. The method of claim 4, wherein determining a set of phrase compositions comprising a predetermined number of consecutive words based on the word segmentation result comprises:
performing word frequency statistics based on the word segmentation result;
determining a low-frequency vocabulary with the word frequency lower than a preset threshold value based on the word frequency statistics result;
and eliminating phrases containing the low-frequency words and/or punctuation marks from the phrases containing the preset number of continuous words, and forming the set based on the rest phrases.
6. The method of claim 3, wherein the determining a relevancy score for the phrase to a preset topic comprises:
determining a first association degree of the preset theme and the text to be analyzed based on a pre-trained latent dirichlet distribution model;
determining a second association degree of the phrase and the preset theme based on the sum of the first occurrence times of the phrase in the preset text corresponding to the preset theme and the second occurrence times of each phrase contained in the set in the preset text;
And calculating a relevance score of the phrase and the preset theme based on the first relevance and the second relevance.
7. The method of claim 3, wherein determining an importance score for the phrase comprises:
and calculating importance scores of the phrases based on the third occurrence times of the phrases in the text to be analyzed and the fourth occurrence times of the phrases in a preset corpus, wherein the preset corpus comprises the text to be analyzed.
8. A text position analysis device, the device comprising:
the acquisition module is used for acquiring the text to be analyzed and the topic phrase corresponding to the text to be analyzed;
the extraction module is used for extracting the characteristics of the text to be analyzed and the topic phrase to obtain a text vector and a topic vector;
the analysis module is used for inputting the text vector and the theme vector into a pre-trained layered capsule model to obtain an analysis result which is output by the layered capsule model and used for performing standing analysis on the text to be analyzed according to the theme phrase, the layered capsule model comprises a word capsule layer, a sentence capsule layer and a category capsule layer, and the word capsule layer, the sentence capsule layer and the category capsule layer are respectively used for extracting word-level features, sentence-level features and category features of the text to be analyzed.
9. A computer device, comprising: a memory; a processor; a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program which, when executed by a processor, implements the text position analysis method according to any of claims 1-7.
CN202311492828.9A 2023-11-09 2023-11-09 Text position analysis method, device, equipment and storage medium Pending CN117574892A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311492828.9A CN117574892A (en) 2023-11-09 2023-11-09 Text position analysis method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311492828.9A CN117574892A (en) 2023-11-09 2023-11-09 Text position analysis method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117574892A true CN117574892A (en) 2024-02-20

Family

ID=89859854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311492828.9A Pending CN117574892A (en) 2023-11-09 2023-11-09 Text position analysis method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117574892A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117952083A (en) * 2024-03-26 2024-04-30 中国电子科技集团公司第三十研究所 Multi-target fine granularity standpoint analysis method based on capsule network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117952083A (en) * 2024-03-26 2024-04-30 中国电子科技集团公司第三十研究所 Multi-target fine granularity standpoint analysis method based on capsule network
CN117952083B (en) * 2024-03-26 2024-07-16 中国电子科技集团公司第三十研究所 Multi-target fine granularity standpoint analysis method based on capsule network

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN110704621B (en) Text processing method and device, storage medium and electronic equipment
CN113435203B (en) Multi-modal named entity recognition method and device and electronic equipment
CN111160031A (en) Social media named entity identification method based on affix perception
CN107330011A (en) The recognition methods of the name entity of many strategy fusions and device
CN113221545B (en) Text processing method, device, equipment, medium and program product
CN111221939A (en) Grading method and device and electronic equipment
CN112836514A (en) Nested entity recognition method and device, electronic equipment and storage medium
CN111414746B (en) Method, device, equipment and storage medium for determining matching statement
CN114595327A (en) Data enhancement method and device, electronic equipment and storage medium
CN117574892A (en) Text position analysis method, device, equipment and storage medium
CN113343706A (en) Text depression tendency detection system based on multi-modal features and semantic rules
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN114692655A (en) Translation system and text translation, download, quality check and editing method
CN112800184A (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN109298796B (en) Word association method and device
Arikan et al. Detecting clitics related orthographic errors in Turkish
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium
CN110619073B (en) Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
Mekki et al. COTA 2.0: An automatic corrector of Tunisian Arabic social media texts
CN113392638B (en) Text evaluation method, device, equipment and medium
CN113407673A (en) Semantic-based question answering judging method and device and electronic equipment
KR101544639B1 (en) Method for estimating user emotion from inputted string
CN111444708A (en) SQ L statement intelligent completion method based on use scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination