CN111061838A - Text feature keyword determination method and device and storage medium - Google Patents

Text feature keyword determination method and device and storage medium Download PDF

Info

Publication number
CN111061838A
CN111061838A CN201911313067.XA CN201911313067A CN111061838A CN 111061838 A CN111061838 A CN 111061838A CN 201911313067 A CN201911313067 A CN 201911313067A CN 111061838 A CN111061838 A CN 111061838A
Authority
CN
China
Prior art keywords
text
sub
character
determining
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911313067.XA
Other languages
Chinese (zh)
Other versions
CN111061838B (en
Inventor
邓立邦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Intellvision Technology Co ltd
Original Assignee
Guangdong Intellvision Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Intellvision Technology Co ltd filed Critical Guangdong Intellvision Technology Co ltd
Priority to CN201911313067.XA priority Critical patent/CN111061838B/en
Publication of CN111061838A publication Critical patent/CN111061838A/en
Application granted granted Critical
Publication of CN111061838B publication Critical patent/CN111061838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The embodiment of the invention discloses a method, a device and a storage medium for determining text characteristic keywords, wherein the method comprises the steps of obtaining a text to be processed, and splitting the text to be processed to obtain a plurality of sub-texts; determining text characteristics corresponding to each sub-text according to a text characteristic judgment template; counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text; and determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic. According to the scheme, the text feature keywords are accurately and efficiently extracted, and subsequent further data analysis is facilitated.

Description

Text feature keyword determination method and device and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a text feature keyword determining method and device and a storage medium.
Background
Text sentiment analysis is also called opinion mining, tendency analysis and the like, and is a process of analyzing, processing, inducing and reasoning subjective texts with sentiment colors. With the development of social networks, e-commerce, mobile internet and other technologies, blogs, forums and social service networks such as public comments generate a great amount of text comment data information which is participated by users and has valuable emotional colors for people, events, products and the like, and the text comment data information is rapidly expanded and does not express various emotional colors and emotional tendencies of people, such as happiness, anger, grief, happiness, criticism, praise and praise. The text comment data information is fully mined and deeply analyzed, so that the viewpoints and the standpoints of netizens can be better understood, and the decision in various fields such as public opinion management and control, business decision, viewpoint search, information prediction, emotion management and the like can be better assisted.
Therefore, how to classify the emotion of the text to judge the emotion main tendency of a certain section of text and extract the core emotion keywords in the text section is convenient for further deep analysis of the emotion of the text, and the method becomes a current research hotspot.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a storage medium for determining text characteristic keywords, which realize accurate and efficient extraction of the text characteristic keywords and facilitate subsequent further data analysis.
In a first aspect, an embodiment of the present invention provides a method for determining text feature keywords, where the method includes:
acquiring a text to be processed, and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
determining text characteristics corresponding to each sub-text according to a text characteristic judgment template;
counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text;
and determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In a second aspect, an embodiment of the present invention further provides a device for determining text feature keywords, where the device includes:
the text splitting module is used for acquiring a text to be processed and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
the text characteristic determining module is used for determining the text characteristic corresponding to each sub-text according to the text characteristic judging template;
the text character counting module is used for counting the characters of each sub-text and determining the text character with the highest occurrence frequency corresponding to the text characteristics in each sub-text;
and the keyword determining module is used for determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the text feature keyword determination method according to the embodiment of the present invention.
In a fourth aspect, the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the text feature keyword determination method according to the present invention.
In the embodiment of the invention, a text to be processed is acquired, and the text to be processed is split to obtain a plurality of sub-texts; determining text characteristics corresponding to each sub-text according to a text characteristic judgment template; counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text; the text characters with the highest frequency of occurrence are determined as the keywords corresponding to the text features, so that the text feature keywords are accurately and efficiently extracted, and subsequent further data analysis is facilitated.
Drawings
Fig. 1 is a flowchart of a text feature keyword determination method according to an embodiment of the present invention;
fig. 2 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 3 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 4 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 5 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 6 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention;
fig. 7 is a first schematic diagram of a graph according to an embodiment of the present invention;
FIG. 8 is a second schematic view of a graph according to an embodiment of the present invention;
FIG. 9 is a third schematic view of a graph according to an embodiment of the present invention;
fig. 10 is a block diagram illustrating a structure of a text feature keyword determining apparatus according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad invention. It should be further noted that, for convenience of description, only some structures, not all structures, relating to the embodiments of the present invention are shown in the drawings.
Fig. 1 is a flowchart of a text feature keyword determining method according to an embodiment of the present invention, where this embodiment is applicable to determining a keyword corresponding to a text feature for a text paragraph, for example, an emotion feature corresponding to a text paragraph and a core keyword of the emotion feature may be determined for a comment data, and the method may be executed by a computing device such as a server computer, and specifically includes the following steps:
step S101, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
The text to be processed is the acquired text which needs to be subjected to text characteristic keyword determination, such as a section of comment text. In one embodiment, after the text to be processed is acquired, the text to be processed is split to obtain a plurality of sub-texts. For example, a piece of text to be processed may be a large piece of speech, which includes sentences, and each sentence may be determined as a sub-text.
Specifically, the text splitting manner may be: and recognizing punctuation marks of the text to be processed, and splitting the text to be processed according to a recognition result to obtain a plurality of sub-texts. Specifically, the period punctuation marks included in the text to be processed can be identified, and the period punctuation marks are used as the basis for text splitting, for example, the text before and after the period punctuation marks is split.
The text splitting mode can also be as follows: and dividing the text to be processed according to the number of preset characters to obtain a plurality of sub-texts. The preset number of characters can be set by itself, and may be, for example, 20 characters, 30 characters, or 50 characters.
And S102, determining text characteristics corresponding to each sub-text according to the text characteristic judgment template.
The text feature judgment template is a template which is obtained based on neural network training and can be used for text feature judgment, after a sub-text is obtained, text features corresponding to the sub-text can be determined according to the text feature judgment template, wherein the text features comprise emotion features, for example, the text features totally comprise four features which are respectively a text feature I, a text feature II, a text feature III and a text feature IV, the text feature I can represent happiness, the text feature II can represent calmness, the text feature III can represent sadness, and the text feature IV can represent anger.
Step S103, counting characters of each sub-text, and determining text characters with the highest frequency of occurrence corresponding to text features in each sub-text.
In one embodiment, after the sub-texts and the text features corresponding to the sub-texts are obtained, the characters of each sub-text are counted. Illustratively, the text to be processed includes 200 characters, which are divided into 10 sub-texts, each sub-text is assumed to include 20 characters, and each sub-text corresponds to a text feature, and different sub-texts may correspond to the same text feature. Illustratively, the relationship between the sub-text and the corresponding text features and the characters with the highest frequency of occurrence of the sub-text is shown in the following table:
sub-text 1 Text feature 4 Character a
Sub-text 2 Text feature 1 Character b
Sub-text 3 Text feature 6 Character c
Sub-text 4 Text feature 2 Character d
Sub-text 5 Text feature 4 Character e
Sub-text 6 Text feature 3 Character f
Sub-text 7 Text feature 3 Character g
Sub-text 8 Text feature 5 Character h
Sub-text 9 Text feature 6 Character i
Sub-text 10 Text feature 4 Character j
It should be noted that the above characters a to j are only exemplary, and the characters a to g may have repeated characters.
And step S104, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In one embodiment, the occurrence frequency of text characters is counted, and after a text character with the highest occurrence frequency is determined, the text character is used as a keyword corresponding to a corresponding text feature, as shown in the above example list, character a is a keyword corresponding to a text feature 4, character b is a keyword corresponding to a text feature 1, character c is a keyword corresponding to a text feature 6, character d is a keyword corresponding to a text feature 2, character e is a keyword corresponding to a text feature 4, character f is a keyword corresponding to a text feature 3, character g is a keyword corresponding to a text feature 3, character h is a keyword corresponding to a text feature 5, character i is a keyword corresponding to a text feature 6, and character j is a keyword corresponding to a text feature 4.
According to the scheme, the text to be processed is obtained, the text to be processed is split to obtain a plurality of sub-texts, the text characteristics corresponding to each sub-text are determined according to the text characteristic judgment template, the characters of each sub-text are counted, the text character with the highest occurrence frequency corresponding to the text characteristics in each sub-text is determined, the text character with the highest occurrence frequency is determined as the keyword corresponding to the text characteristics, the text characteristics of the text and the keyword corresponding to the text characteristics can be determined quickly and efficiently, the accurate and efficient extraction of the text characteristic keywords is achieved, and the follow-up further data analysis is facilitated.
Fig. 2 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, which provides a specific method for determining keywords corresponding to text features. As shown in fig. 2, the technical solution is as follows:
step S201, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
Step S202, determining text characteristics corresponding to each sub text according to the text characteristic judgment template.
Step S203, counting characters of each sub-text, and determining text characters with the highest frequency of occurrence corresponding to text features in each sub-text.
Step S204, determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset occurrence proportion, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In one embodiment, after determining the text character with the highest occurrence frequency, further determining whether the text character with the highest occurrence frequency is greater than a preset occurrence ratio, where the preset occurrence ratio may be 70%, if it is greater than the preset occurrence ratio, determining the text character with the highest occurrence frequency as the keyword corresponding to the text feature, and if it is less than the preset occurrence ratio, not determining the text character as the keyword corresponding to the text feature.
According to the scheme, the accuracy of the determined keywords is improved by further determining the proportion of the keywords with the highest occurrence frequency, and the extraction of the text feature keywords is optimized.
Fig. 3 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, and provides a specific method for determining a keyword corresponding to a text feature. As shown in fig. 3, the technical solution is as follows:
step S301, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
Step S302, determining text characteristics corresponding to each sub text according to the text characteristic judgment template.
Step S303, counting characters of each sub-text, and determining the text character with the highest frequency of occurrence corresponding to the text feature in each sub-text.
Step S304, determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset proportion of occurrences of the same text feature subfile, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text feature.
In one embodiment, in determining the text character with the highest frequency of occurrence, it is determined whether the text character with the highest frequency of occurrence is greater than a preset proportion of occurrences of the same text feature subfolders. Specifically, the same text feature sub-text refers to the determined sub-texts with the same text features, taking the table listed in step S103 as an example, the text features corresponding to the sub-text 1, the sub-text 5 and the sub-text 10 are all the text features 4, then the sub-text 1, the sub-text 5 and the sub-text 10 are the sub-texts with the same text features, at this time, the characters with the highest occurrence frequency respectively determined in the sub-texts 1, the sub-text 5 and the sub-text 10 are the characters a, e and g (wherein, the characters a, e and j may be the same or different), at this time, it is determined whether the text character with the highest occurrence frequency is greater than the preset proportion of the occurrence of the same text feature sub-text, taking the character a as an example, it is determined whether the text character respectively appears in the sub-text 1, the sub-text 1 is the sub-text for determining the character a, it must exist, the sub-script 5 and the sub-script 10 may exist or may not exist, for example, assuming that the character a exists in the sub-script 5 and does not exist in the sub-script 10, the proportion of the character a is 0.67, if the preset proportion is set to 0.5, the character a may be determined as the keyword corresponding to the text feature 4, and similarly, the proportions of the character b and the character c are sequentially confirmed and whether the character b and the character c are the keywords corresponding to the text feature 4 is determined.
According to the scheme, whether the text character with the highest occurrence frequency is larger than the preset proportion of the text character with the same text feature, if so, the text character with the highest occurrence frequency is determined as the keyword corresponding to the text feature, so that the accuracy of extracting the determined keyword is further improved, and the text feature representing capability of the keyword is stronger.
Fig. 4 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, which provides a specific method for determining keywords corresponding to text features. As shown in fig. 4, the technical solution is as follows:
step S401, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
And S402, determining text characteristics corresponding to each sub-text according to the text characteristic judgment template.
Step S403, counting the characters of each sub-text, and determining the text character with the highest frequency of occurrence corresponding to the text feature in each sub-text.
Step S404, determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset proportion of occurrence in all the sub-texts, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text feature.
In one embodiment, after determining the text character with the highest frequency of occurrence, determining whether the text character with the highest frequency of occurrence is greater than a preset proportion of occurrences in all sub-texts. Specifically, it is assumed that there are 10 subfiles, which are subfile 1, subfile 2, subfile 3, subfile 4, subfile 5, subfile 6, subfile 7, subfile 8, subfile 9, and subfile 10. Assuming that the character with the highest frequency of occurrence is determined to be the character e in the subfile 5, and the determined text feature corresponding to the subfile 5 is the text feature 4, it is determined whether the character e exists in the subfile 2, the subfile 3, the subfile 4, the subfile 1, the subfile 6, the subfile 7, the subfile 8, the subfile 9, and the subfile 10, and if the occurrence ratio of the character e is greater than a preset ratio (an exemplary preset ratio value is 0.7), it is determined that the character e is the keyword corresponding to the text feature 4.
According to the scheme, whether the text characters with the highest occurrence frequency are larger than the preset proportion appearing in all the sub texts is judged by determining the text characters with the highest occurrence frequency, if so, the text characters with the highest occurrence frequency are determined as the keywords corresponding to the text features, the whole context of the text to be processed is reasonably considered, the text feature keywords are accurately and efficiently extracted, and subsequent further data analysis is facilitated.
Fig. 5 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, which provides a specific method for determining a text feature determination template. As shown in fig. 5, the technical solution is as follows:
step S501, distributing identification information for the identification characters, and generating character identification association data.
The identification characters are characters used for comment or record of the user, and may be at least one of Chinese characters, punctuation characters, numeric characters and English characters, and the identification characters are not limited to the listed characters and may include any other contents capable of being entered and displayed.
The identification information is data that plays a role of identification, each identification character is assigned with unique corresponding identification information, illustratively, serial numbers 1 to 10000, each serial number corresponds to an identification character, for example, identification characters "day", "ground", and "person" correspond to identification information 1, 2, and 3, respectively.
The character identification association data can be stored in a manner of associating fields in a mapping table or a database, that is, the identification characters are associated with the allocated identification information, and after the identification characters are determined, the identification information corresponding to the identification characters can be uniquely determined.
Step S502, a text training set is obtained, a curve coordinate graph corresponding to each training text is generated according to the character identification association data, the curve coordinate graph is used as input, corresponding training text characteristics are used as output, and a text characteristic judgment template is obtained by training through a neural network.
The text training set is a combination of texts for learning and training, the text training set comprises a plurality of training texts, each training text comprises a plurality of identification characters, and the training texts can be comment data acquired through a network, such as comment data of a news, an entertainment event or a movie and television play.
As described above, each training text is composed of a plurality of identification characters, and each identification character and corresponding identification information are recorded in the character identification association data. In one embodiment, a curve coordinate graph corresponding to each training text is generated according to the character identification association data, wherein the curve coordinate graph is marked with the occurrence frequency of each character in the training text and the corresponding identification information.
In the training process, text features can be set, for example, ten text feature grades are set, corresponding unique text feature grades are respectively matched for different training texts, a curve coordinate graph corresponding to each training text is used as input, the corresponding set text feature grade is used as output to conduct neural network training to obtain a text feature judgment template, and the specific neural network training can be implemented by using the existing mature neural network, such as a CNN convolutional neural network. And recording a corresponding standard characteristic curve coordinate graph and a corresponding text characteristic in the text characteristic template obtained by training. Illustratively, the text feature template records: the characteristic curve 1 corresponds to the text characteristic 1; the characteristic curve 2 corresponds to the text characteristic 2; the characteristic curve 3 corresponds to the text characteristic 3. It should be noted that the above example of the text feature template record is used for illustration, and a large number of feature graphs and corresponding text features are recorded in the template, wherein each text feature may also correspond to a plurality of different feature graphs, such as the feature curve 100, the feature curve 125, and the feature curve 306, which correspond to the text feature 5.
Step S503, obtaining a text to be processed, and performing text splitting on the text to be processed to obtain a plurality of sub-texts.
And step S504, determining the text characteristics corresponding to each sub text according to the text characteristic judgment template.
And step S505, counting the characters of each sub-text, and determining the text character with the highest frequency of occurrence corresponding to the text features in each sub-text.
Step S506, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text feature.
According to the scheme, identification information is distributed to the identification characters, character identification association data is generated, a text training set is obtained, the text set comprises a plurality of training texts and training text characteristics corresponding to each training text, each curve coordinate graph corresponding to each training text is generated according to the character identification association data, the curve coordinate graphs serve as input, the corresponding training text characteristics serve as output, a neural network is used for training to obtain a text characteristic judgment template, the text characteristics can be judged efficiently and accurately, and the efficiency is higher than that of a manual mode or other intelligent modes.
Fig. 6 is a flowchart of another text feature keyword determination method according to an embodiment of the present invention, which provides a specific text feature that determines a corresponding text feature of each sub-text. As shown in fig. 6, the technical solution is as follows:
step S601, distributing identification information for the identification characters, and generating character identification association data.
Step S602, a text training set is obtained, a curve coordinate graph corresponding to each training text is generated according to the character identification association data, the curve coordinate graph is used as input, corresponding training text characteristics are used as output, and a text characteristic judgment template is obtained by training through a neural network.
Step S603, a text to be processed is obtained, and the text to be processed is split to obtain a plurality of sub-texts.
Step S604, determining the identification characters in each sub-text, and generating a corresponding curve coordinate graph according to the character identification association data.
In one embodiment, each sub-script is processed to obtain a corresponding curve plot. Specifically, identification characters in each sub-text are identified, and an identification corresponding to each identification character is determined according to the character identification association data to generate a curve coordinate graph. The abscissa of the curve coordinate graph is identification information, and the ordinate is the occurrence frequency of each training identification character. Exemplarily, as shown in fig. 7, fig. 7 is a first schematic diagram of a curved-line graph according to an embodiment of the present invention, where 1 to 6000 pieces of identification information are recorded in the abscissa, each piece of identification information corresponds to one identification character, and the ordinate is marked with the statistical number of times from 1 to 30 times.
Step S605, comparing the curve coordinate graph corresponding to each sub-text with the characteristic curve coordinate graph in the text characteristic judgment template, and determining the text characteristic of each sub-text according to the comparison result.
In one embodiment, one or more text feature determination templates are stored for different text features, and each text feature determination template can be embodied by a specific curve coordinate graph. And comparing the curve coordinate graph corresponding to each sub-text with the characteristic curve coordinate graph in the text characteristic judgment template, and determining the text characteristic of each sub-text according to the comparison result.
For example, fig. 8 shows a curve coordinate diagram of four text features and corresponding text feature determination templates, respectively, and fig. 8 is a second schematic diagram of a curve coordinate diagram provided in an embodiment of the present invention, as shown in fig. 8, where a first text feature can represent happy, a second text feature can represent calm, a third text feature can represent sad, and a fourth text feature can represent angry.
Fig. 9 is a third schematic diagram of a curve coordinate graph according to an embodiment of the present invention, as shown in fig. 9, the curve coordinate graph is exemplarily a curve coordinate graph corresponding to a sub-text, the curve coordinate graph is compared with the curve coordinate graph in fig. 8 to determine a curve coordinate graph with the highest similarity, the curve coordinate graph is used as a curve coordinate graph matched with the sub-text, and a text feature corresponding to the matched curve coordinate graph is determined as a text feature of the sub-text, and when the text feature is exemplarily determined to be a text feature of the sub-text, a corresponding sub-text characterization is a pessimistic emotion.
Step S606, counting characters of each sub-text, and determining text characters with the highest frequency of occurrence corresponding to text features in each sub-text.
Step S607, the text character with the highest frequency of occurrence is determined as the keyword corresponding to the text feature.
According to the scheme, the curve coordinate graph corresponding to the sub-texts is matched with the curve graph in the text characteristic judgment template, the text characteristics corresponding to the sub-texts are output according to the matching result, rapid judgment of the text characteristics can be achieved, decisions in various fields such as public opinion management and control, business decision, viewpoint search, information prediction, emotion management and the like are better assisted, and the determined keywords have practical guiding significance.
Fig. 10 is a block diagram of a text feature keyword determining apparatus according to an embodiment of the present invention, where the apparatus is configured to execute the text feature keyword determining method according to the foregoing embodiment, and has corresponding functional modules and beneficial effects of the executing method. As shown in fig. 10, the apparatus specifically includes: a text splitting module 101, a text feature determination module 102, a text character statistics module 103, and a keyword determination module 104, wherein,
the text splitting module 101 is configured to obtain a text to be processed, and perform text splitting on the text to be processed to obtain a plurality of sub-texts;
the text feature determining module 102 is configured to determine a text feature corresponding to each sub-text according to a text feature judgment template;
the text character counting module 103 is configured to count characters of each sub-text, and determine a text character with the highest occurrence frequency corresponding to a text feature in each sub-text;
and the keyword determining module 104 is configured to determine the text character with the highest frequency of occurrence as the keyword corresponding to the text feature.
According to the scheme, the text to be processed is obtained, the text to be processed is split to obtain a plurality of sub-texts, the text characteristics corresponding to each sub-text are determined according to the text characteristic judgment template, the characters of each sub-text are counted, the text character with the highest occurrence frequency corresponding to the text characteristics in each sub-text is determined, the text character with the highest occurrence frequency is determined as the keyword corresponding to the text characteristics, the text characteristics of the text and the keyword corresponding to the text characteristics can be determined quickly and efficiently, the accurate and efficient extraction of the text characteristic keywords is achieved, and the follow-up further data analysis is facilitated.
In a possible embodiment, the text splitting module 101 is specifically configured to:
recognizing punctuation marks of the text to be processed, and splitting the text to be processed according to a recognition result to obtain a plurality of sub-texts; or
And dividing the text to be processed according to the number of preset characters to obtain a plurality of sub-texts.
In a possible embodiment, the determining the text character with the highest frequency of occurrence as the keyword corresponding to the text feature includes:
determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset occurrence proportion, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
In a possible embodiment, the determining whether the text character with the highest occurrence frequency is greater than a preset occurrence ratio includes:
judging whether the text characters with the highest occurrence frequency are larger than a preset proportion of the text characters appearing in the same text feature subfile or not; or judging whether the text character with the highest frequency of occurrence is larger than a preset proportion of occurrence in all the sub-texts.
In one possible embodiment, the apparatus further comprises a training module 105 for, before obtaining the text to be processed:
allocating identification information to the identification characters to generate character identification association data, wherein the identification characters comprise at least one of Chinese characters, punctuation characters, numeric characters and English characters;
acquiring a text training set, wherein the text set comprises a plurality of training texts and training text characteristics corresponding to each training text, and each training text is composed of one or more identification characters;
generating a curve coordinate graph corresponding to each training text according to the character identification association data;
and taking the curve coordinate graph as input, taking the corresponding training text characteristic as output, and training by utilizing a neural network to obtain a text characteristic judgment template.
In a possible embodiment, the training module 105 is specifically configured to:
determining each training identification character contained in the training text;
determining identification information corresponding to each training identification character according to the character identification association data;
and counting the identification information of each training identification character to generate a curve coordinate graph corresponding to the training text, wherein the abscissa of the curve coordinate graph is the identification information, and the ordinate is the occurrence frequency of each training identification character.
In a possible embodiment, the text feature determination module 102 is specifically configured to:
determining an identification character in each sub-text, and generating a corresponding curve coordinate graph according to the character identification association data;
comparing the curve coordinate graph corresponding to each sub-text with the characteristic curve coordinate graph in the text characteristic judgment template;
and determining the text characteristics of each sub-text according to the comparison result.
Fig. 11 is a schematic structural diagram of an apparatus according to an embodiment of the present invention, as shown in fig. 11, the apparatus includes a processor 201, a memory 202, an input device 203, and an output device 204; the number of the processors 201 in the device may be one or more, and one processor 201 is taken as an example in fig. 11; the processor 201, the memory 202, the input device 203 and the output device 204 in the apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 11.
The memory 202 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text feature keyword determination method in the embodiment of the present invention. The processor 201 executes various functional applications and data processing of the device by running software programs, instructions and modules stored in the memory 202, that is, implements the text feature keyword determination method described above.
The memory 202 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 202 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 202 may further include memory located remotely from the processor 201, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 203 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus. The output device 204 may include a display device such as a display screen.
Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for text feature keyword determination, the method including:
acquiring a text to be processed, and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
determining text characteristics corresponding to each sub-text according to a text characteristic judgment template;
counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text;
and determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
From the above description of the embodiments, it is obvious for those skilled in the art that the embodiments of the present invention can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better implementation in many cases. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device) perform the methods described in the embodiments of the present invention.
It should be noted that, in the embodiment of the text feature keyword determination apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.
It should be noted that the foregoing is only a preferred embodiment of the present invention and the technical principles applied. Those skilled in the art will appreciate that the embodiments of the present invention are not limited to the specific embodiments described herein, and that various obvious changes, adaptations, and substitutions are possible, without departing from the scope of the embodiments of the present invention. Therefore, although the embodiments of the present invention have been described in more detail through the above embodiments, the embodiments of the present invention are not limited to the above embodiments, and many other equivalent embodiments may be included without departing from the concept of the embodiments of the present invention, and the scope of the embodiments of the present invention is determined by the scope of the appended claims.

Claims (10)

1. The text characteristic keyword determining method is characterized by comprising the following steps:
acquiring a text to be processed, and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
determining text characteristics corresponding to each sub-text according to a text characteristic judgment template;
counting characters of each sub-text, and determining text characters with the highest occurrence frequency corresponding to text features in each sub-text;
and determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
2. The method according to claim 1, wherein the obtaining the text to be processed and performing text splitting on the text to be processed to obtain a plurality of sub-texts comprises:
recognizing punctuation marks of the text to be processed, and splitting the text to be processed according to a recognition result to obtain a plurality of sub-texts; or
And dividing the text to be processed according to the number of preset characters to obtain a plurality of sub-texts.
3. The method according to claim 1 or 2, wherein the determining the text character with the highest frequency of occurrence as the keyword corresponding to the text feature comprises:
determining the text character with the highest frequency of occurrence, judging whether the text character with the highest frequency of occurrence is larger than a preset occurrence proportion, and if so, determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
4. The method of claim 3, wherein the determining whether the text character with the highest frequency of occurrence is greater than a preset occurrence ratio comprises:
judging whether the text characters with the highest occurrence frequency are larger than a preset proportion of the text characters appearing in the same text feature subfile or not; or judging whether the text character with the highest frequency of occurrence is larger than a preset proportion of occurrence in all the sub-texts.
5. The method of claim 1, prior to obtaining the text to be processed, further comprising:
allocating identification information to the identification characters to generate character identification association data, wherein the identification characters comprise at least one of Chinese characters, punctuation characters, numeric characters and English characters;
acquiring a text training set, wherein the text set comprises a plurality of training texts and training text characteristics corresponding to each training text, and each training text is composed of one or more identification characters;
generating a curve coordinate graph corresponding to each training text according to the character identification association data;
and taking the curve coordinate graph as input, taking the corresponding training text characteristic as output, and training by utilizing a neural network to obtain a text characteristic judgment template.
6. The method of claim 5, wherein generating a curve graph corresponding to each training text according to the character identification association data comprises:
determining each training identification character contained in the training text;
determining identification information corresponding to each training identification character according to the character identification association data;
and counting the identification information of each training identification character to generate a curve coordinate graph corresponding to the training text, wherein the abscissa of the curve coordinate graph is the identification information, and the ordinate is the occurrence frequency of each training identification character.
7. The method of claim 5, wherein determining the text feature corresponding to each of the sub-texts according to the text feature determination template comprises:
determining an identification character in each sub-text, and generating a corresponding curve coordinate graph according to the character identification association data;
comparing the curve coordinate graph corresponding to each sub-text with the characteristic curve coordinate graph in the text characteristic judgment template;
and determining the text characteristics of each sub-text according to the comparison result.
8. The text feature keyword determination device is characterized by comprising:
the text splitting module is used for acquiring a text to be processed and performing text splitting on the text to be processed to obtain a plurality of sub-texts;
the text characteristic determining module is used for determining the text characteristic corresponding to each sub-text according to the text characteristic judging template;
the text character counting module is used for counting the characters of each sub-text and determining the text character with the highest occurrence frequency corresponding to the text characteristics in each sub-text;
and the keyword determining module is used for determining the text character with the highest frequency of occurrence as the keyword corresponding to the text characteristic.
9. A text feature keyword determination apparatus, the text feature keyword determination apparatus comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the text feature keyword determination method of any one of claims 1-7.
10. A storage medium containing computer-executable instructions for performing the text feature keyword determination method of any one of claims 1-7 when executed by a computer processor.
CN201911313067.XA 2019-12-18 2019-12-18 Text feature keyword determination method and device and storage medium Active CN111061838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911313067.XA CN111061838B (en) 2019-12-18 2019-12-18 Text feature keyword determination method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911313067.XA CN111061838B (en) 2019-12-18 2019-12-18 Text feature keyword determination method and device and storage medium

Publications (2)

Publication Number Publication Date
CN111061838A true CN111061838A (en) 2020-04-24
CN111061838B CN111061838B (en) 2023-04-07

Family

ID=70302410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911313067.XA Active CN111061838B (en) 2019-12-18 2019-12-18 Text feature keyword determination method and device and storage medium

Country Status (1)

Country Link
CN (1) CN111061838B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010072A (en) * 2021-04-27 2021-06-22 维沃移动通信(杭州)有限公司 Searching method and device, electronic equipment and readable storage medium
CN113486184A (en) * 2021-09-07 2021-10-08 北京达佳互联信息技术有限公司 Keyword determination method, device, equipment and storage medium
CN116629254A (en) * 2023-05-05 2023-08-22 杭州正策信息科技有限公司 Policy text analysis method based on text analysis and recognition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170052948A1 (en) * 2007-12-18 2017-02-23 Apple Inc. System and Method for Analyzing and Categorizing Text
CN107168952A (en) * 2017-05-15 2017-09-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170052948A1 (en) * 2007-12-18 2017-02-23 Apple Inc. System and Method for Analyzing and Categorizing Text
CN107168952A (en) * 2017-05-15 2017-09-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孟凡博;蔡莲红;陈斌;吴鹏;: "文本褒贬倾向判定系统的研究" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010072A (en) * 2021-04-27 2021-06-22 维沃移动通信(杭州)有限公司 Searching method and device, electronic equipment and readable storage medium
CN113486184A (en) * 2021-09-07 2021-10-08 北京达佳互联信息技术有限公司 Keyword determination method, device, equipment and storage medium
CN113486184B (en) * 2021-09-07 2022-01-21 北京达佳互联信息技术有限公司 Keyword determination method, device, equipment and storage medium
CN116629254A (en) * 2023-05-05 2023-08-22 杭州正策信息科技有限公司 Policy text analysis method based on text analysis and recognition
CN116629254B (en) * 2023-05-05 2024-03-22 杭州正策信息科技有限公司 Policy text analysis method based on text analysis and recognition

Also Published As

Publication number Publication date
CN111061838B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN109147934B (en) Inquiry data recommendation method, device, computer equipment and storage medium
CN111061838B (en) Text feature keyword determination method and device and storage medium
Akaichi et al. Text mining facebook status updates for sentiment classification
CN110888990B (en) Text recommendation method, device, equipment and medium
CN113704451B (en) Power user appeal screening method and system, electronic device and storage medium
Akaichi Social networks' Facebook'statutes updates mining for sentiment classification
CN105183717B (en) A kind of OSN user feeling analysis methods based on random forest and customer relationship
CN108073568A (en) keyword extracting method and device
CN107423440B (en) Question-answer context switching and reinforced selection method based on emotion analysis
JP2017215931A (en) Conference support system, conference support device, conference support method, and program
CN108846138B (en) Question classification model construction method, device and medium fusing answer information
CN113360622B (en) User dialogue information processing method and device and computer equipment
CN109101551B (en) Question-answer knowledge base construction method and device
CN110147552B (en) Education resource quality evaluation mining method and system based on natural language processing
WO2018227930A1 (en) Method and device for intelligently prompting answers
CN106354818A (en) Dynamic user attribute extraction method based on social media
CN111767393A (en) Text core content extraction method and device
CN110910175A (en) Tourist ticket product portrait generation method
Mohammad et al. Identifying purpose behind electoral tweets
Walha et al. A Lexicon approach to multidimensional analysis of tweets opinion
CN112115712A (en) Topic-based group emotion analysis method
CN111859955A (en) Public opinion data analysis model based on deep learning
CN115033675A (en) Conversation method, conversation device, electronic equipment and storage medium
CN110895557B (en) Text feature judgment method and device based on neural network and storage medium
KR102078541B1 (en) Issue interest based news value evaluation apparatus and method, storage media storing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant