CN111061838B - Text feature keyword determination method and device and storage medium - Google Patents

Text feature keyword determination method and device and storage medium Download PDF

Info

Publication number
CN111061838B
CN111061838B CN201911313067.XA CN201911313067A CN111061838B CN 111061838 B CN111061838 B CN 111061838B CN 201911313067 A CN201911313067 A CN 201911313067A CN 111061838 B CN111061838 B CN 111061838B
Authority
CN
China
Prior art keywords
text
character
sub
training
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911313067.XA
Other languages
Chinese (zh)
Other versions
CN111061838A (en
Inventor
邓立邦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Zhimeiyuntu Tech Corp ltd
Original Assignee
Guangdong Zhimeiyuntu Tech Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Zhimeiyuntu Tech Corp ltd filed Critical Guangdong Zhimeiyuntu Tech Corp ltd
Priority to CN201911313067.XA priority Critical patent/CN111061838B/en
Publication of CN111061838A publication Critical patent/CN111061838A/en
Application granted granted Critical
Publication of CN111061838B publication Critical patent/CN111061838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method, a device and a storage medium for determining text characteristic keywords, wherein the method comprises the steps of obtaining a text to be processed, and splitting the text to be processed to obtain a plurality of sub-texts; determining text characteristics corresponding to each sub-text according to a text characteristic judgment template; counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text; and determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic. According to the scheme, the text feature keywords are accurately and efficiently extracted, and subsequent further data analysis is facilitated.

Description

Text feature keyword determination method and device and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a text feature keyword determining method and device and a storage medium.
Background
Text sentiment analysis is also called opinion mining, tendency analysis and the like, and is a process for analyzing, processing, inducing and reasoning subjective texts with sentiment colors. With the development of social networks, e-commerce, mobile internet and other technologies, blogs, forums and social service networks such as public opinion produce a great deal of text comment data information which is participated by users and has valuable emotional colors for people such as characters, events, products and the like, and the text comment data information rapidly expands and does not express various emotional colors and emotional tendencies of people such as happiness, anger, grief, happiness, criticism and praise. The text comment data information is fully mined and deeply analyzed, so that the viewpoints and the standpoints of netizens can be better understood, and the decision in various fields such as public opinion management and control, business decision, viewpoint search, information prediction, emotion management and the like can be better assisted.
Therefore, how to classify the emotion of the text to judge the emotion main tendency of a certain section of text and extract the core emotion keywords in the text section is convenient for further deep analysis of the emotion of the text, and the method becomes a current research hotspot.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a storage medium for determining text characteristic keywords, which realize accurate and efficient extraction of the text characteristic keywords and facilitate subsequent further data analysis.
In a first aspect, an embodiment of the present invention provides a method for determining text feature keywords, where the method includes:
acquiring a text to be processed, and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
determining text characteristics corresponding to each sub-text according to a text characteristic judgment template;
counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text;
and determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In a second aspect, an embodiment of the present invention further provides a device for determining text feature keywords, where the device includes:
the text splitting module is used for acquiring a text to be processed and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
the text characteristic determining module is used for determining the text characteristic corresponding to each sub text according to the text characteristic judging template;
the text character counting module is used for counting the characters of each sub-text and determining the text character with the highest occurrence frequency corresponding to the text characteristics in each sub-text;
and the keyword determining module is used for determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:
one or more processors;
a storage device to store one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the text feature keyword determination method according to the embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are executed by a computer processor to perform the text feature keyword determination method according to the embodiment of the present invention.
In the embodiment of the invention, a text to be processed is acquired, and the text to be processed is split to obtain a plurality of sub-texts; determining text characteristics corresponding to each sub-text according to a text characteristic judgment template; counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text; the text characters with the highest frequency of occurrence are determined as the keywords corresponding to the text features, so that the text feature keywords are accurately and efficiently extracted, and subsequent further data analysis is facilitated.
Drawings
Fig. 1 is a flowchart of a text feature keyword determination method according to an embodiment of the present invention;
fig. 2 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 3 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 4 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 5 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 6 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
FIG. 7 is a first schematic view of a graph plot according to an embodiment of the present invention;
FIG. 8 is a second schematic view of a curved plot according to an embodiment of the present invention;
FIG. 9 is a third schematic view of a graph according to an embodiment of the present invention;
fig. 10 is a block diagram illustrating a structure of a text feature keyword determining apparatus according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in further detail with reference to the drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the embodiments of the invention and do not delimit the embodiments. It should be further noted that, for convenience of description, only some structures related to the embodiments of the present invention are shown in the drawings, not all of them.
Fig. 1 is a flowchart of a text feature keyword determining method according to an embodiment of the present invention, where this embodiment is applicable to determining a keyword corresponding to a text feature for a text paragraph, for example, an emotion feature corresponding to a text paragraph and a core keyword of the emotion feature may be determined for a comment data, and the method may be executed by a computing device such as a server computer, and specifically includes the following steps:
step S101, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
The text to be processed is the acquired text which needs to be subjected to text characteristic keyword determination, such as a section of comment text. In one embodiment, after the text to be processed is obtained, the text to be processed is split to obtain a plurality of sub-texts. For example, a piece of text to be processed may be a large piece of speech, which includes sentences, and each sentence may be determined as a sub-text.
Specifically, the text splitting manner may be: and recognizing punctuation marks of the text to be processed, and splitting the text to be processed according to a recognition result to obtain a plurality of sub-texts. Specifically, the full stop punctuation marks included in the text to be processed can be identified, and the full stop punctuation marks are used as the basis for text splitting, for example, the text before and after the full stop punctuation marks is split.
The text splitting mode can also be as follows: and dividing the text to be processed according to the number of preset characters to obtain a plurality of sub-texts. The preset number of characters can be set by itself, and may be, for example, 20 characters, 30 characters, or 50 characters.
And S102, determining text characteristics corresponding to each sub-text according to the text characteristic judgment template.
The text feature judgment template is a template which is obtained based on neural network training and can be used for text feature judgment, after a sub-text is obtained, text features corresponding to the sub-text can be determined according to the text feature judgment template, wherein the text features comprise emotion features, for example, the text features totally comprise four features which are respectively a text feature I, a text feature II, a text feature III and a text feature IV, the text feature I can represent happiness, the text feature II can represent calmness, the text feature III can represent sadness, and the text feature IV can represent anger.
Step S103, counting characters of each sub-text, and determining text characters with the highest frequency of occurrence corresponding to text features in each sub-text.
In one embodiment, after the sub-texts and the text features corresponding to the sub-texts are obtained, the characters of each sub-text are counted. Illustratively, the text to be processed includes 200 characters, which are divided into 10 sub-texts, each sub-text is assumed to include 20 characters, and each sub-text corresponds to a text feature, and different sub-texts may correspond to the same text feature. Illustratively, the relationship between the sub-text and the corresponding text features and the characters with the highest frequency of occurrence of the sub-text is shown in the following table:
sub text 1 Text feature 4 Character a
Sub text 2 Text feature 1 Character b
Sub-text 3 Text feature 6 Character c
Sub-text 4 Text feature 2 Character d
Sub-text 5 Text feature 4 Character e
Sub-text 6 Text feature 3 Character f
Sub-text 7 Text feature 3 Character g
Sub-text 8 Text feature 5 Character h
Sub-text 9 Text feature 6 Character i
Sub-text 10 Text feature 4 Character j
It should be noted that the above characters a to j are only exemplary, and the characters a to g may have repeated characters.
And step S104, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In one embodiment, the occurrence frequency of text characters is counted, and after a text character with the highest occurrence frequency is determined, the text character is used as a keyword corresponding to a corresponding text feature, as shown in the above example list, character a is a keyword corresponding to a text feature 4, character b is a keyword corresponding to a text feature 1, character c is a keyword corresponding to a text feature 6, character d is a keyword corresponding to a text feature 2, character e is a keyword corresponding to a text feature 4, character f is a keyword corresponding to a text feature 3, character g is a keyword corresponding to a text feature 3, character h is a keyword corresponding to a text feature 5, character i is a keyword corresponding to a text feature 6, and character j is a keyword corresponding to a text feature 4.
According to the scheme, the text to be processed is obtained, the text to be processed is split to obtain a plurality of sub-texts, the text characteristics corresponding to each sub-text are determined according to the text characteristic judgment template, the characters of each sub-text are counted, the text character with the highest occurrence frequency corresponding to the text characteristics in each sub-text is determined, the text character with the highest occurrence frequency is determined as the keyword corresponding to the text characteristics, the text characteristics of the text and the keyword corresponding to the text characteristics can be determined quickly and efficiently, the accurate and efficient extraction of the text characteristic keywords is achieved, and the follow-up further data analysis is facilitated.
Fig. 2 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, which provides a specific method for determining keywords corresponding to text features. As shown in fig. 2, the technical solution is as follows:
step S201, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
Step S202, determining text characteristics corresponding to each sub text according to the text characteristic judgment template.
Step S203, counting characters of each sub text, and determining text characters with the highest frequency of occurrence corresponding to text features in each sub text.
Step S204, determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset occurrence proportion, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In one embodiment, after determining the text character with the highest occurrence frequency, further determining whether the text character with the highest occurrence frequency is greater than a preset occurrence ratio, where the preset occurrence ratio may be 70%, if it is greater than the preset occurrence ratio, determining the text character with the highest occurrence frequency as the keyword corresponding to the text feature, and if it is less than the preset occurrence ratio, not determining the text character as the keyword corresponding to the text feature.
According to the scheme, the accuracy of the determined keywords is improved by further determining the proportion of the keywords with the highest occurrence frequency, and the extraction of the text feature keywords is optimized.
Fig. 3 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, and provides a specific method for determining a keyword corresponding to a text feature. As shown in fig. 3, the technical solution is as follows:
step S301, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
Step S302, determining text characteristics corresponding to each sub text according to the text characteristic judgment template.
Step S303, counting characters of each sub-text, and determining the text character with the highest frequency of occurrence corresponding to the text feature in each sub-text.
Step S304, determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset proportion of occurrences of the same text feature subfile, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text feature.
In one embodiment, in determining the text character with the highest frequency of occurrence, it is determined whether the text character with the highest frequency of occurrence is greater than a preset proportion of occurrences of the same text feature subfolders. Specifically, the same text feature sub-text refers to the determined sub-texts with the same text features, taking the table listed in step S103 as an example, the text features corresponding to the sub-text 1, the sub-text 5 and the sub-text 10 are all the text features 4, then the sub-text 1, the sub-text 5 and the sub-text 10 are the sub-texts with the same text features, at this time, the characters with the highest occurrence frequency respectively determined in the sub-text 1, the sub-text 5 and the sub-text 10 are the characters a, the characters e and the characters g (wherein, the characters a, the characters e and the characters j may be the same or different), at this time, it is determined whether the text characters with the highest occurrence frequency are greater than the preset proportion of the sub-texts with the same text features, taking the character a as an example, it is determined whether the characters respectively appear in the sub-text 1, the sub-text 1 is the sub-text that the characters a are determined, and exist, the sub-texts 5 and the sub-text 10 may or not exist, the exemplary character a is assumed that the sub-text features corresponding to the characters a 5 and the keywords are 0.4, and the corresponding characters c are determined as the preset proportion of the keywords.
According to the scheme, whether the text character with the highest occurrence frequency is larger than the preset proportion of the text character with the same text feature, if so, the text character with the highest occurrence frequency is determined as the keyword corresponding to the text feature, so that the accuracy of extracting the determined keyword is further improved, and the text feature representing capability of the keyword is stronger.
Fig. 4 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, which provides a specific method for determining keywords corresponding to text features. As shown in fig. 4, the technical solution is as follows:
step S401, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
And S402, determining text characteristics corresponding to each sub-text according to the text characteristic judgment template.
Step S403, counting the characters of each sub-text, and determining the text character with the highest frequency of occurrence corresponding to the text feature in each sub-text.
Step S404, determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset proportion of occurrence in all the sub-texts, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text feature.
In one embodiment, after determining the text character with the highest occurrence frequency, determining whether the text character with the highest occurrence frequency is larger than a preset proportion of the text characters in all the sub texts. Specifically, it is assumed that there are 10 subfiles, which are subfile 1, subfile 2, subfile 3, subfile 4, subfile 5, subfile 6, subfile 7, subfile 8, subfile 9, and subfile 10. Assuming that the character with the highest frequency of occurrence is determined to be the character e in the subfile 5, and the determined text feature corresponding to the subfile 5 is the text feature 4, it is determined whether the character e exists in the subfile 2, the subfile 3, the subfile 4, the subfile 1, the subfile 6, the subfile 7, the subfile 8, the subfile 9, and the subfile 10, and if the occurrence ratio of the character e is greater than a preset ratio (an exemplary preset ratio value is 0.7), it is determined that the character e is the keyword corresponding to the text feature 4.
According to the scheme, whether the text character with the highest occurrence frequency is larger than the preset proportion of the text character in all the sub-texts is judged by determining the text character with the highest occurrence frequency, if so, the text character with the highest occurrence frequency is determined as the keyword corresponding to the text characteristic, the whole context of the text to be processed is reasonably considered, the text characteristic keyword is accurately and efficiently extracted, and the subsequent further data analysis is facilitated.
Fig. 5 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, which provides a specific method for determining a text feature determination template. As shown in fig. 5, the technical solution is as follows:
step S501, distributing identification information for the identification characters, and generating character identification association data.
The identification characters are characters used for comment or record of the user, and may be at least one of Chinese characters, punctuation characters, numeric characters and English characters, and the identification characters are not limited to the listed characters and may include any other contents capable of being entered and displayed.
The identification information is data that plays a role of identification, each identification character is assigned with unique corresponding identification information, illustratively, serial numbers 1 to 10000, each serial number corresponds to an identification character, for example, identification characters "day", "ground", and "person" correspond to identification information 1, 2, and 3, respectively.
The character identification association data can be stored in a manner of associating fields in a mapping table or a database, that is, the identification characters are associated with the allocated identification information, and after the identification characters are determined, the identification information corresponding to the identification characters can be uniquely determined.
Step S502, a text training set is obtained, a curve coordinate graph corresponding to each training text is generated according to the character identification association data, the curve coordinate graph is used as input, corresponding training text characteristics are used as output, and a text characteristic judgment template is obtained by training through a neural network.
The text training set is a combination of texts for learning and training, the text training set includes a plurality of training texts, each training text includes a plurality of identification characters, and illustratively, the training texts may be comment data obtained through a network, such as comment data for a news, an entertainment event, or a movie and television play.
As described above, each training text is composed of a plurality of identification characters, and each identification character and corresponding identification information are recorded in the character identification association data. In one embodiment, a curve coordinate graph corresponding to each training text is generated according to the character identification association data, wherein the curve coordinate graph is marked with the occurrence frequency of each character in the training text and the corresponding identification information.
In the training process, text features can be set, for example, ten text feature grades are set, corresponding unique text feature grades are respectively matched for different training texts, a curve coordinate graph corresponding to each training text is used as input, the corresponding set text feature grade is used as output to conduct neural network training to obtain a text feature judgment template, and the specific neural network training can be implemented by using the existing mature neural network, such as a CNN convolutional neural network. And recording a corresponding standard characteristic curve coordinate graph and a corresponding text characteristic in the text characteristic template obtained by training. Illustratively, the text feature template records: the characteristic curve 1 corresponds to the text characteristic 1; the characteristic curve 2 corresponds to the text characteristic 2; the characteristic curve 3 corresponds to the text characteristic 3. It should be noted that the above example of the text feature template record is used for illustration, and a large number of feature graphs and corresponding text features are recorded in the template, wherein each text feature may also correspond to a plurality of different feature graphs, such as the feature curve 100, the feature curve 125, and the feature curve 306, which correspond to the text feature 5.
Step S503, a text to be processed is obtained, and the text to be processed is subjected to text splitting to obtain a plurality of sub-texts.
And step S504, determining the text characteristics corresponding to each sub text according to the text characteristic judgment template.
And step S505, counting the characters of each sub-text, and determining the text character with the highest frequency of occurrence corresponding to the text features in each sub-text.
Step S506, determining the text character with the highest occurrence frequency as the keyword corresponding to the text feature.
According to the scheme, identification information is distributed to the identification characters, character identification association data is generated, a text training set is obtained, the text set comprises a plurality of training texts and training text characteristics corresponding to each training text, each curve coordinate graph corresponding to each training text is generated according to the character identification association data, the curve coordinate graphs serve as input, the corresponding training text characteristics serve as output, a neural network is used for training to obtain a text characteristic judgment template, the text characteristics can be judged efficiently and accurately, and the efficiency is higher than that of a manual mode or other intelligent modes.
Fig. 6 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, which provides a specific text feature that determines a corresponding text feature of each sub-text. As shown in fig. 6, the technical solution is as follows:
step S601, distributing identification information for the identification characters, and generating character identification association data.
Step S602, a text training set is obtained, a curve coordinate graph corresponding to each training text is generated according to the character identification association data, the curve coordinate graph is used as input, corresponding training text characteristics are used as output, and a text characteristic judgment template is obtained by training through a neural network.
Step S603, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
And step S604, determining the identification characters in each sub text, and generating a corresponding curve coordinate graph according to the character identification association data.
In one embodiment, each sub-script is processed to obtain a corresponding curve plot. Specifically, identification characters in each sub-text are identified, and an identification corresponding to each identification character is determined according to the character identification association data to generate a curve coordinate graph. The abscissa of the curve coordinate graph is identification information, and the ordinate is the number of times of occurrence of each training identification character. Exemplarily, as shown in fig. 7, fig. 7 is a first schematic diagram of a curved-line graph according to an embodiment of the present invention, where 1 to 6000 pieces of identification information are recorded in the abscissa, each piece of identification information corresponds to one identification character, and the ordinate is marked with the statistical number of times from 1 to 30 times.
Step S605, comparing the curve coordinate graph corresponding to each sub-text with the characteristic curve coordinate graph in the text characteristic judgment template, and determining the text characteristic of each sub-text according to the comparison result.
In one embodiment, one or more text feature determination templates are stored for different text features, and each text feature determination template can be embodied by a specific curve coordinate graph. And comparing the curve coordinate graph corresponding to each sub-text with the characteristic curve coordinate graph in the text characteristic judgment template, and determining the text characteristic of each sub-text according to the comparison result.
For example, fig. 8 shows a curve coordinate diagram of four text features and corresponding text feature determination templates, respectively, and fig. 8 is a second schematic diagram of a curve coordinate diagram provided in an embodiment of the present invention, as shown in fig. 8, where a first text feature can represent happy, a second text feature can represent calm, a third text feature can represent sad, and a fourth text feature can represent angry.
Fig. 9 is a third schematic diagram of a curve coordinate graph according to an embodiment of the present invention, as shown in fig. 9, the curve coordinate graph is exemplarily a curve coordinate graph corresponding to a sub-text, the curve coordinate graph is compared with the curve coordinate graph in fig. 8 to determine a curve coordinate graph with the highest similarity, the curve coordinate graph is used as a curve coordinate graph matched with the sub-text, and a text feature corresponding to the matched curve coordinate graph is determined as a text feature of the sub-text, and when the text feature is exemplarily determined to be a text feature of the sub-text, a corresponding sub-text characterization is a pessimistic emotion.
Step S606, counting characters of each sub-text, and determining text characters with the highest frequency of occurrence corresponding to text features in each sub-text.
Step S607, the text character with the highest frequency of occurrence is determined as the keyword corresponding to the text feature.
According to the scheme, the curve coordinate graph corresponding to the sub-texts is matched with the curve graph in the text characteristic judgment template, the text characteristics corresponding to the sub-texts are output according to the matching result, rapid judgment of the text characteristics can be achieved, decisions in various fields such as public opinion management and control, business decision, viewpoint search, information prediction, emotion management and the like are better assisted, and the determined keywords have practical guiding significance.
Fig. 10 is a block diagram of a text feature keyword determining apparatus according to an embodiment of the present invention, where the apparatus is configured to execute the text feature keyword determining method according to the foregoing embodiment, and has corresponding functional modules and beneficial effects of the executing method. As shown in fig. 10, the apparatus specifically includes: a text splitting module 101, a text feature determination module 102, a text character statistics module 103, and a keyword determination module 104, wherein,
the text splitting module 101 is configured to obtain a text to be processed, and perform text splitting on the text to be processed to obtain a plurality of sub-texts;
the text feature determining module 102 is configured to determine a text feature corresponding to each sub-text according to a text feature judgment template;
the text character counting module 103 is configured to count characters of each sub-text, and determine a text character with the highest occurrence frequency corresponding to a text feature in each sub-text;
and the keyword determining module 104 is configured to determine the text character with the highest frequency of occurrence as the keyword corresponding to the text feature.
According to the scheme, the text to be processed is obtained, the text to be processed is split to obtain a plurality of sub-texts, the text characteristics corresponding to each sub-text are determined according to the text characteristic judgment template, the characters of each sub-text are counted, the text character with the highest occurrence frequency corresponding to the text characteristics in each sub-text is determined, the text character with the highest occurrence frequency is determined as the keyword corresponding to the text characteristics, the text characteristics of the text and the keyword corresponding to the text characteristics can be determined quickly and efficiently, the accurate and efficient extraction of the text characteristic keywords is achieved, and the follow-up further data analysis is facilitated.
In a possible embodiment, the text splitting module 101 is specifically configured to:
recognizing punctuation marks of the text to be processed, and splitting the text to be processed according to a recognition result to obtain a plurality of sub-texts; or
And dividing the text to be processed according to the number of preset characters to obtain a plurality of sub-texts.
In a possible embodiment, the determining the text character with the highest frequency of occurrence as the keyword corresponding to the text feature includes:
determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset occurrence proportion, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In a possible embodiment, the determining whether the text character with the highest occurrence frequency is greater than a preset occurrence ratio includes:
judging whether the text characters with the highest occurrence frequency are larger than a preset proportion of the text characters appearing in the same text feature subfile or not; or judging whether the text character with the highest frequency of occurrence is larger than a preset proportion of occurrence in all the sub-texts.
In one possible embodiment, the apparatus further comprises a training module 105 for, before obtaining the text to be processed:
allocating identification information to the identification characters to generate character identification association data, wherein the identification characters comprise at least one of Chinese characters, punctuation characters, numeric characters and English characters;
acquiring a text training set, wherein the text set comprises a plurality of training texts and training text characteristics corresponding to each training text, and each training text is composed of one or more identification characters;
generating a curve coordinate graph corresponding to each training text according to the character identification association data;
and taking the curve coordinate graph as input, taking the corresponding training text characteristic as output, and training by utilizing a neural network to obtain a text characteristic judgment template.
In a possible embodiment, the training module 105 is specifically configured to:
determining each training identification character contained in the training text;
determining identification information corresponding to each training identification character according to the character identification association data;
and counting the identification information of each training identification character to generate a curve coordinate graph corresponding to the training text, wherein the abscissa of the curve coordinate graph is the identification information, and the ordinate is the occurrence frequency of each training identification character.
In a possible embodiment, the text feature determination module 102 is specifically configured to:
determining an identification character in each sub-text, and generating a corresponding curve coordinate graph according to the character identification association data;
comparing the curve coordinate graph corresponding to each sub-text with the characteristic curve coordinate graph in the text characteristic judgment template;
and determining the text characteristics of each sub-text according to the comparison result.
Fig. 11 is a schematic structural diagram of an apparatus according to an embodiment of the present invention, as shown in fig. 11, the apparatus includes a processor 201, a memory 202, an input device 203, and an output device 204; the number of the processors 201 in the device may be one or more, and one processor 201 is taken as an example in fig. 11; the processor 201, the memory 202, the input device 203 and the output device 204 in the apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 11.
The memory 202 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text feature keyword determination method in the embodiment of the present invention. The processor 201 executes various functional applications and data processing of the device by running software programs, instructions and modules stored in the memory 202, that is, implements the text feature keyword determination method described above.
The memory 202 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 202 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 202 may further include memory located remotely from the processor 201, which may be connected to the devices over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 203 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus. The output device 204 may include a display device such as a display screen.
Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for text feature keyword determination, the method including:
acquiring a text to be processed, and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
determining text characteristics corresponding to each sub text according to a text characteristic judgment template;
counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text;
and determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
From the above description of the embodiments, it is obvious for those skilled in the art that the embodiments of the present invention can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better implementation in many cases. Based on such understanding, the technical solutions of the embodiments of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present invention.
It should be noted that, in the embodiment of the text feature keyword determination apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.
It should be noted that the foregoing is only a preferred embodiment of the present invention and the technical principles applied. Those skilled in the art will appreciate that the embodiments of the present invention are not limited to the specific embodiments described herein, and that various obvious changes, rearrangements and substitutions can be made by those skilled in the art without departing from the scope of the embodiments of the invention. Therefore, although the embodiments of the present invention have been described in more detail through the above embodiments, the embodiments of the present invention are not limited to the above embodiments, and many other equivalent embodiments may be included without departing from the concept of the embodiments of the present invention, and the scope of the embodiments of the present invention is determined by the scope of the appended claims.

Claims (7)

1. The text characteristic keyword determining method is characterized by comprising the following steps:
distributing identification information for the identification characters to generate character identification associated data;
acquiring a text training set, wherein the text training set comprises a plurality of training texts and training text characteristics corresponding to each training text, and each training text is composed of one or more identification characters;
determining each training identification character contained in the training text, determining identification information corresponding to each training identification character according to the character identification association data, counting the identification information of each training identification character, and generating a curve coordinate graph corresponding to the training text;
taking the curve coordinate graph as input, taking corresponding training text characteristics as output, and training by utilizing a neural network to obtain a text characteristic judgment template;
acquiring a text to be processed, and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
determining identification characters in each sub-text, generating a corresponding curve coordinate graph according to the character identification association data, comparing the curve coordinate graph corresponding to each sub-text with the characteristic curve coordinate graph in the text characteristic judgment template, and determining the text characteristic of each sub-text according to the comparison result;
counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text;
and determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
2. The method according to claim 1, wherein the obtaining the text to be processed and performing text splitting on the text to be processed to obtain a plurality of sub-texts comprises:
recognizing punctuation marks of the text to be processed, and splitting the text to be processed according to a recognition result to obtain a plurality of sub-texts; or
And dividing the text to be processed according to the number of preset characters to obtain a plurality of sub-texts.
3. The method according to claim 1 or 2, wherein the determining the text character with the highest occurrence frequency as the keyword corresponding to the text feature comprises:
determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset occurrence proportion, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
4. The method of claim 3, wherein the determining whether the text character with the highest frequency of occurrence is greater than a preset occurrence ratio comprises:
judging whether the text character with the highest occurrence frequency is larger than a preset proportion of the text character appearing in the same text characteristic subfile or not; or judging whether the text character with the highest occurrence frequency is larger than a preset proportion appearing in all the sub texts.
5. The text feature keyword determination device is characterized by comprising:
the training module is used for distributing identification information to the identification characters, generating character identification association data and acquiring a text training set, wherein the text training set comprises a plurality of training texts and training text characteristics corresponding to each training text, each training text consists of one or more identification characters, each training identification character contained in each training text is determined, the identification information corresponding to each training identification character is determined according to the character identification association data, the identification information of each training identification character is counted, a curve coordinate graph corresponding to each training text is generated, the curve coordinate graph is used as input, the corresponding training text characteristics are used as output, and a neural network is used for training to obtain a text characteristic judgment template;
the text splitting module is used for acquiring a text to be processed and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
the text feature determination module is used for determining a text feature corresponding to each sub-text according to a text feature judgment template, specifically, for determining an identification character in each sub-text, generating a corresponding curve coordinate graph according to the character identification association data, comparing the curve coordinate graph corresponding to each sub-text with the feature curve coordinate graph in the text feature judgment template, and determining the text feature of each sub-text according to a comparison result;
the text character counting module is used for counting the characters of each sub-text and determining the text character with the highest occurrence frequency corresponding to the text characteristics in each sub-text;
and the keyword determining module is used for determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
6. A text feature keyword determination apparatus, the text feature keyword determination apparatus comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the text feature keyword determination method of any one of claims 1-4.
7. A storage medium containing computer executable instructions for performing the text feature keyword determination method of any one of claims 1-4 when executed by a computer processor.
CN201911313067.XA 2019-12-18 2019-12-18 Text feature keyword determination method and device and storage medium Active CN111061838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911313067.XA CN111061838B (en) 2019-12-18 2019-12-18 Text feature keyword determination method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911313067.XA CN111061838B (en) 2019-12-18 2019-12-18 Text feature keyword determination method and device and storage medium

Publications (2)

Publication Number Publication Date
CN111061838A CN111061838A (en) 2020-04-24
CN111061838B true CN111061838B (en) 2023-04-07

Family

ID=70302410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911313067.XA Active CN111061838B (en) 2019-12-18 2019-12-18 Text feature keyword determination method and device and storage medium

Country Status (1)

Country Link
CN (1) CN111061838B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010072A (en) * 2021-04-27 2021-06-22 维沃移动通信(杭州)有限公司 Searching method and device, electronic equipment and readable storage medium
CN113486184B (en) * 2021-09-07 2022-01-21 北京达佳互联信息技术有限公司 Keyword determination method, device, equipment and storage medium
CN116629254B (en) * 2023-05-05 2024-03-22 杭州正策信息科技有限公司 Policy text analysis method based on text analysis and recognition

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168952A (en) * 2017-05-15 2017-09-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8136034B2 (en) * 2007-12-18 2012-03-13 Aaron Stanton System and method for analyzing and categorizing text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168952A (en) * 2017-05-15 2017-09-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孟凡博 ; 蔡莲红 ; 陈斌 ; 吴鹏 ; .文本褒贬倾向判定系统的研究.小型微型计算机系统.2009,(第07期),第212-215页. *

Also Published As

Publication number Publication date
CN111061838A (en) 2020-04-24

Similar Documents

Publication Publication Date Title
US11093854B2 (en) Emoji recommendation method and device thereof
CN108255805B (en) Public opinion analysis method and device, storage medium and electronic equipment
CN110888990B (en) Text recommendation method, device, equipment and medium
CN111061838B (en) Text feature keyword determination method and device and storage medium
US7983902B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
CN110457672B (en) Keyword determination method and device, electronic equipment and storage medium
WO2015185019A1 (en) Semantic comprehension-based expression input method and apparatus
CN107423440B (en) Question-answer context switching and reinforced selection method based on emotion analysis
Akaichi Social networks' Facebook'statutes updates mining for sentiment classification
CN105183717B (en) A kind of OSN user feeling analysis methods based on random forest and customer relationship
JP2017215931A (en) Conference support system, conference support device, conference support method, and program
CN111324713B (en) Automatic replying method and device for conversation, storage medium and computer equipment
CN109101551B (en) Question-answer knowledge base construction method and device
CN111309916B (en) Digest extracting method and apparatus, storage medium, and electronic apparatus
CN111767393A (en) Text core content extraction method and device
CN113360622A (en) User dialogue information processing method and device and computer equipment
WO2018227930A1 (en) Method and device for intelligently prompting answers
CN107085568A (en) A kind of text similarity method of discrimination and device
CN107807920A (en) Construction method, device and the server of mood dictionary based on big data
Mohammad et al. Identifying purpose behind electoral tweets
CN109298796B (en) Word association method and device
CN110895557B (en) Text feature judgment method and device based on neural network and storage medium
CN113505293B (en) Information pushing method and device, electronic equipment and storage medium
KR102078541B1 (en) Issue interest based news value evaluation apparatus and method, storage media storing the same
CN115033675A (en) Conversation method, conversation device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant