CN113051369A - Text content identification method and device, readable storage medium and electronic equipment - Google Patents

Text content identification method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN113051369A
CN113051369A CN202110351284.9A CN202110351284A CN113051369A CN 113051369 A CN113051369 A CN 113051369A CN 202110351284 A CN202110351284 A CN 202110351284A CN 113051369 A CN113051369 A CN 113051369A
Authority
CN
China
Prior art keywords
classification
text information
classified
determining
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110351284.9A
Other languages
Chinese (zh)
Inventor
范宁磊
陈鹏
张锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dami Technology Co Ltd
Original Assignee
Beijing Dami Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dami Technology Co Ltd filed Critical Beijing Dami Technology Co Ltd
Priority to CN202110351284.9A priority Critical patent/CN113051369A/en
Publication of CN113051369A publication Critical patent/CN113051369A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education

Abstract

The embodiment of the invention discloses a text content identification method, a text content identification device, a readable storage medium and electronic equipment. The method comprises the steps of obtaining text information to be classified and a keyword list corresponding to a classification subject, and then determining keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are keywords appearing in the text information to be classified in the keyword list; determining the classification probability of the text information to be classified and the classification subject correlation according to the occurrence probability of the keywords of the text information to be classified; finally, determining the relevance of the classification text information and the classification subject according to the classification probability; if the classified text information is the communication record of the working personnel and the user and the classification subject is the set course, whether the working personnel recommend the set course suitable for the user to the user or not can be accurately judged through the method.

Description

Text content identification method and device, readable storage medium and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a text content identification method, a text content identification device, a readable storage medium and electronic equipment.
Background
With the development of internet application, the traditional teaching mode of people is changed by online teaching, the online teaching platform is more and more widely used in daily life of people, a large number of users are arranged on the online teaching platform, the required courses are different due to different requirements of different users, but the users can hardly screen out the courses suitable for the users from the large number of courses, and then the workers of the online teaching platform are required to recommend the courses suitable for the users to the users; the online teaching platform needs to monitor the work of a worker, and ensures that the worker recommends a course suitable for the user to the user; in the prior art, whether a worker recommends a course to a user is determined through a call record of the worker and the user, specifically, whether at least one keyword corresponding to the course is mentioned in the call record is judged, if yes, the worker is judged to recommend the course to the user, and if not, the worker is judged not to recommend the course to the user; the prior art methods have a large error, for example, the case where the course is not actually recommended although the keyword is mentioned.
In summary, how to accurately judge whether the worker recommends the course suitable for the user to the user is a problem to be solved at present.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for text content identification, a readable storage medium, and an electronic device, which improve the accuracy of determining whether a worker recommends a course suitable for a user to the user.
In a first aspect, an embodiment of the present invention provides a text content identification method, where the method includes: acquiring text information to be classified; acquiring a keyword list corresponding to a classification subject, wherein the keyword list comprises a plurality of predetermined keywords and the occurrence probability of each keyword; determining keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are keywords appearing in the text information to be classified in the keyword list; determining the classification probability of the text information to be classified and the classification subject correlation according to the occurrence probability corresponding to the keywords of the text information to be classified; and determining the relevance of the classification text information and the classification subject according to the classification probability.
Preferably, the classification probability is used for characterizing that the text information to be classified is positively correlated with the classification topic, and the determining the classification probability of the text information to be classified being correlated with the classification topic according to the occurrence probability specifically includes:
determining a plurality of first difference values, wherein each first difference value is the difference between 1 and the occurrence probability of the keywords of the text information to be classified;
determining a product of the plurality of first differences as the classification probability.
Preferably, the determining the relevance of the classification text information and the classification topic according to the classification probability specifically includes:
and determining that the classified text information is positively correlated with the classification subject in response to the classification probability being smaller than a preset threshold value.
Preferably, the classification probability is used to represent that the text information to be classified is negatively correlated with the classification topic, and the determining the classification probability that the text information to be classified is correlated with the classification topic according to the occurrence probability specifically includes:
and determining the continuous product of the occurrence probabilities of the keywords of the text information to be classified as the classification probability.
Preferably, the determining the relevance of the classification text information and the classification topic according to the classification probability specifically includes:
and in response to the classification probability being smaller than a preset threshold value, determining that the classification text information is negatively related to the classification subject.
Preferably, the determining process of the keyword list comprises:
acquiring a historical sample data set corresponding to a classification subject and at least one candidate keyword corresponding to the historical sample data set, wherein the historical sample data comprises historical positive sample data and historical negative sample data;
determining the occurrence probability of each candidate keyword in the historical positive sample data and the historical negative sample data according to the historical sample data;
determining keywords according to the occurrence probability;
and generating the keyword list according to the keywords and the occurrence probability.
Preferably, the determining the keyword according to the occurrence probability specifically includes:
and determining the candidate keywords with the occurrence probability in the historical positive sample data being greater than the set multiple of the occurrence probability in the historical negative sample data as the keywords.
Preferably, the threshold is predetermined according to a receiver operating characteristic ROC curve.
Preferably, the threshold is predetermined according to a receiver operating characteristic ROC curve, and specifically includes:
determining a first proportion and a second proportion of the ROC curve, wherein the first proportion is a proportion which is correctly judged as a positive sample when all actually positive samples exist, and the second proportion is a proportion which is wrongly judged as a positive sample segment when all actually negative samples exist;
determining a maximum value of the first ratio and second ratio difference as the threshold value.
Preferably, the method further comprises:
acquiring audio data to be processed;
and inputting the audio data into an automatic voice recognition model, and outputting the text information to be processed.
In a second aspect, an embodiment of the present invention provides an apparatus for text content recognition, where the apparatus includes:
the acquiring unit is used for acquiring text information to be classified;
the obtaining unit is further configured to obtain a keyword list corresponding to the classification subject, where the keyword list includes a plurality of predetermined keywords and an occurrence probability of each keyword;
the first determining unit is used for determining the keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are the keywords appearing in the text information to be classified in the keyword list;
the second determining unit is used for determining the classification probability of the text information to be classified and the classification subject according to the occurrence probability corresponding to the key words of the text information to be classified;
and the third determining unit is used for determining the relevance of the classification text information and the classification subject according to the classification probability.
Preferably, the classification probability is used to represent that the text information to be classified is positively correlated with the classification topic, and the second determining unit is specifically configured to:
determining a plurality of first difference values, wherein each first difference value is the difference between 1 and the occurrence probability of the keywords of the text information to be classified;
determining a product of the plurality of first differences as the classification probability.
Preferably, the third determining unit is specifically configured to:
and determining that the classified text information is positively correlated with the classification subject in response to the classification probability being smaller than a preset threshold value.
Preferably, the classification probability is used to represent that the text information to be classified is negatively related to the classification topic, and the second determining unit is further specifically configured to:
and determining the continuous product of the occurrence probabilities of the keywords of the text information to be classified as the classification probability.
Preferably, the third determining unit is further specifically configured to:
and in response to the classification probability being smaller than a preset threshold value, determining that the classification text information is negatively related to the classification subject.
Preferably, in the process of determining the keyword list, the obtaining unit is further configured to:
acquiring a historical sample data set corresponding to a classification subject and at least one candidate keyword corresponding to the historical sample data set, wherein the historical sample data comprises historical positive sample data and historical negative sample data;
the first determination unit is further configured to: determining the occurrence probability of each candidate keyword in the historical positive sample data and the historical negative sample data according to the historical sample data;
the first determination unit is further configured to: determining keywords according to the occurrence probability;
and the generating unit is used for generating the keyword list according to the keywords and the occurrence probability.
Preferably, the first determining unit is specifically configured to:
and determining the candidate keywords with the occurrence probability in the historical positive sample data being greater than the set multiple of the occurrence probability in the historical negative sample data as the keywords.
Preferably, the threshold is predetermined according to a receiver operating characteristic ROC curve.
Preferably, the threshold is predetermined according to a receiver operating characteristic ROC curve, and specifically includes:
the third determining unit is further configured to determine a first proportion and a second proportion of the ROC curve, wherein the first proportion is a proportion that is correctly determined as a positive sample when all actually positive samples are present, and the second proportion is a proportion that is erroneously determined as a positive sample segment when all actually negative samples are present;
the third determining unit is further configured to determine a maximum value of the difference between the first ratio and the second ratio as the threshold.
Preferably, the obtaining unit is further configured to:
acquiring audio data to be processed;
and the processing unit is used for inputting the audio data into an automatic speech recognition model and outputting the text information to be processed.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first aspect or any one of the possibilities of the first aspect.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method according to the first aspect or any one of the possibilities of the first aspect.
The embodiment of the invention obtains the text information to be classified; acquiring a keyword list corresponding to a classification subject, wherein the keyword list comprises a plurality of predetermined keywords and the occurrence probability of each keyword; determining keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are keywords appearing in the text information to be classified in the keyword list; determining the classification probability of the text information to be classified and the classification subject according to the occurrence probability of the keywords of the text information to be classified; and determining the relevance of the classification text information and the classification subject according to the classification probability. By the method, the classified text information is assumed to be the communication record of the working personnel and the user, when the classification subject is the set course, the relevance between the classified text information and the classification subject is determined according to the classification probability, namely, whether the working personnel recommend the set course suitable for the user to the user is judged according to the classification probability, and the accuracy of judging whether the working personnel recommend the course suitable for the user to the user is improved.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a method of text content recognition according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of text content recognition according to an embodiment of the present invention;
FIG. 3 is a flow chart of a method of threshold determination in accordance with an embodiment of the present invention;
FIG. 4 is a flow chart of a method of text content recognition according to an embodiment of the present invention;
FIG. 5 is a flow chart of a method of determining a keyword list in accordance with an embodiment of the present invention;
FIG. 6 is a flow chart of a method of text content recognition according to an embodiment of the present invention;
FIG. 7 is a diagram of an apparatus for text content recognition according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an electronic device of an embodiment of the invention.
Detailed Description
The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, certain specific details are set forth. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present disclosure.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout this specification, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present disclosure, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified.
The online teaching platform has a large number of users, and sets a large number of courses for the users, different users have different required courses due to different requirements, but the users can hardly screen out the courses suitable for the users from the large number of courses, and then the workers of the online teaching platform need to recommend the courses suitable for the users to the users; the online teaching platform needs to monitor the work of a worker, and ensures that the worker recommends a course suitable for the user to the user; in the prior art, whether a worker recommends a course to a user is determined through a call record of the worker and the user, specifically, whether at least one keyword corresponding to the course is mentioned in the call record is judged, if yes, the worker is judged to recommend the course to the user, and if not, the worker is judged not to recommend the course to the user; however, the prior art method has a large error, for example, although a keyword for a lesson is mentioned, the lesson is not actually recommended; in the prior art, the conclusion of judging whether the worker recommends a course to the user is only absolute yes or absolute no, and the accuracy of judgment is also influenced due to the lack of support on statistical data.
In the embodiment of the invention, the relevance between the classified text information and the classification theme is judged by a text content identification method, the classified text information is assumed to be a communication record between a worker and a user, and when the classification theme is a set course, the relevance between the classified text information and the classification theme is determined according to the classification probability, namely whether the worker recommends the set course suitable for the user to the user is judged according to the classification probability, so that the accuracy of judging whether the worker recommends the course suitable for the user to the user is improved.
In the embodiment of the present invention, fig. 1 is a flowchart of a method for recognizing text content according to the embodiment of the present invention. As shown in fig. 1, the method specifically comprises the following steps:
and S100, acquiring text information to be classified.
In a possible implementation manner, the text information to be classified may be call record information, for example, a call record when a worker introduces a course to a user, or other situations that need to classify the text information.
For example, the specific call records may be: "user: your good! What are the courses corresponding to the first-middle-grade requested? The staff: your good! The courses for all subjects in junior-middle-school grade have, for example, the first language, the first mathematics, and the first English language. The user: which subject teacher is a senior teacher? The staff: the English teacher is a high-grade English teacher with abundant teaching experience, and the English course setting is very reasonable; assuming that the first language, the first mathematics and the first English are keywords in the prior art, the above call records can be regarded as a dialog for recommending three courses of the first language, the first mathematics and the first English, but no further recommendation is actually made for the first language and the first mathematics, and only the first English course is recommended in a key point, so that the call records need to be judged reasonably and accurately through subsequent processing.
Step S101, a keyword list corresponding to the classification subject is obtained, wherein the keyword list comprises a plurality of predetermined keywords and the occurrence probability of each keyword.
In one possible implementation manner, when the call record is a call record of a worker introducing a course to a user, the classification subject is a different course, for example, a first english course, a first math course, a first second english course, and the like; or, in other application scenarios, for example, the classification subject is literature, art, science and technology, and the like, the embodiment of the present invention does not limit the specific application scenario, and is determined according to the actual use situation.
In one possible implementation, it is assumed that the classification subjects are different courses, and the different courses correspond to different keyword class lists, for example, the keywords included in the keyword list corresponding to the first english course are: first english, first english course, english teacher, senior english teacher, first english teacher etc to every keyword corresponds its probability of occurrence, wherein, the probability of occurrence is obtained according to historical data statistics, for example, the probability of occurrence of first english is 0.6, and the probability of occurrence of first english course is 0.5, and the probability of english course appearance is 0.7, senior english teacher's probability of occurrence is 0.4 etc, and here is only exemplary explanation, and specific data are confirmed according to actual conditions.
Step S102, determining keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are keywords appearing in the text information to be classified in the keyword list.
Specifically, the keywords in the keyword list included in the text information to be classified are determined according to the occurrence condition of the keywords in the keyword list in the text information to be classified, for example, in a specific example in step S100, the call records are: "user: your good! What are the courses corresponding to the first-middle-grade requested? The staff: your good! The courses for all subjects in junior-middle-school grade have, for example, the first language, the first mathematics, and the first English language. The user: which subject teacher is a senior teacher? The staff: our english teacher is a senior english teacher who has abundant teaching experience to our english course sets up very rationally. First english, first english lessons, english teachers, senior english teachers, first english teachers, and the like; therefore, the keywords included in the call record include: first english, english senior teacher and english course.
Step S103, determining the classification probability of the text information to be classified and the classification subject according to the occurrence probability corresponding to the keywords of the text information to be classified.
In a possible implementation manner, two assumptions may occur in a specific processing process, one assumption is that the classified text information is a positive sample, that is, it is assumed that a set course is recommended in a call record corresponding to the classified text information; the other is to assume that the classified text information is a negative sample, that is, assume that no course is recommended to be set in the call record corresponding to the classified text information. The above two cases are explained in detail by two specific embodiments of fig. 2 and 3.
And step S104, determining the relevance of the classification text information and the classification subject according to the classification probability.
Specifically, in both cases of step S103, the relevance between the classified text information and the classification topic is determined by comparing the classification probability with the threshold, and the specific determination manner is described in detail by two specific embodiments in fig. 2 and fig. 3.
In the embodiment of the present invention, when the classified text information is assumed to be a positive sample, it is assumed to test the probability that the classified text information does not belong to the positive sample but belongs to the negative sample, that is, fig. 2 is a flowchart of a text content identification method according to the embodiment of the present invention. As shown in fig. 2, the method specifically includes the following steps:
and step S200, acquiring text information to be classified.
Step S201, a keyword list corresponding to the classification subject is obtained, wherein the keyword list comprises a plurality of predetermined keywords and the occurrence probability of each keyword.
Step S202, determining keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are keywords appearing in the text information to be classified in the keyword list.
Step S203, determining a plurality of first difference values, where each first difference value is a difference between 1 and an occurrence probability of the keyword of the text information to be classified.
In particular, it is assumed that the probability of occurrence passes through αiShow, assume there are 4 switchesKey words, which are a first English, a first English course, an English course, and a senior English teacher, respectively; the corresponding probability of occurrence is respectively alpha1、α2、α3、α4Specifically, assume that the probability of occurrence of the first english language is α10.6, the probability of occurrence of the first english lesson is α20.5, the probability of English course appearing is alpha30.7, the probability of occurrence of a senior english teacher is α40.4; corresponding four difference values, wherein the first difference value corresponding to the appearance probability of the first English is 1-alpha10.4, the first difference corresponding to the probability of occurrence of the first english lesson is 1- α2The first difference corresponding to the probability of english lesson occurrence is 1- α, 0.530.3, the first difference value corresponding to the appearance probability of the senior english teacher is 1-alpha4=0.6。
Step S204, determining the continuous product of the plurality of first difference values as the classification probability.
Specifically, the classification probability is:
Figure BDA0003002434180000101
assume that the step S203 includes 4 first differences, each being 1- α1=0.4,1-α2=0.5,1-α30.3 and 1-alpha40.4 × 0.5 × 0.3 × 0.6 ═ 0.036, the classification probability is 0.036.
Step S205, in response to the classification probability being smaller than a preset threshold, determining that the classification text information is positively correlated with the classification subject.
Specifically, the positive correlation means that the classification text information is a positive sample, and it is assumed that the classification text information is whether a certain set course is recommended, and since the classification text information is positively correlated with the classification subject, that is, a certain set course is recommended in the classification information.
In a possible implementation manner, if the classification probability is greater than a preset threshold, it is determined that the classified text information is negatively correlated with the classification topic, where the negative correlation is that the classified text information is a negative sample, and it is assumed that the classified text information is whether to recommend a certain set course, and since the classified text information is negatively correlated with the classification topic, that is, a certain set course is not recommended in the classification information.
In one possible implementation, the threshold is predetermined based on a Receiver Operating Characteristic (ROC) curve, specifically referred to as a sensitivity curve (sensitivity curve), because each point on the curve reflects the same sensitivity, and they both respond to the same signal stimulus, but only the result obtained under two different criteria. The working characteristic curve of the subject is a graph formed by taking the False positive probability (False positive rate) as a horizontal axis and the True positive probability (True positive rate) as a vertical axis, and is drawn by different results obtained by the subject under specific stimulation conditions due to different judgment standards. The ROC curve is a curve drawn based on a series of different two classification methods (cut-off values or decision thresholds) with true positive rate (sensitivity) as ordinate and false positive rate (1-specificity) as abscissa.
In a possible implementation manner, when the threshold is predetermined according to the ROC curve of the receiver operating characteristics, the processing procedure is as shown in fig. 3, and fig. 3 is a flowchart of a method for determining the threshold, which specifically includes:
and S300, determining a first proportion and a second proportion of the ROC curve.
Specifically, the first ratio is a ratio that is correctly determined as a positive sample when all actually positive samples are present, and the second ratio is a ratio that is erroneously determined as a positive sample segment when all actually negative samples are present.
In one possible implementation, the first ratio may be referred to as TPR, and is a ratio of all actually positive samples, which are positive samples, determined to be positive correctly, TP/(TP + FN), where TP is a ratio of positive samples predicted to be correct, and FN is a ratio of negative samples predicted to be incorrect. The second ratio may also be referred to as FPR, and is a ratio of all samples that are actually negative, which are erroneously determined to be positive, where FPR/(FP + TN) is a negative sample, FP is a ratio of a positive sample prediction error, and TN is a ratio of a negative sample prediction error.
Step S301, determining a maximum value of the difference between the first ratio and the second ratio as the threshold.
Specifically, the maximum value of the difference between the first ratio and the second ratio is max (TPR-FPR), and the threshold is epsilon.
In the embodiment of the present invention, when the classified text information is assumed to be a negative sample, it is assumed to test the probability that the classified text information does not belong to the negative sample but belongs to the positive sample, that is, fig. 4 is a flowchart of a text content identification method according to the embodiment of the present invention. As shown in fig. 4, the method specifically includes the following steps:
and S400, acquiring text information to be classified.
Step S401, a keyword list corresponding to the classification subject is obtained, wherein the keyword list comprises a plurality of predetermined keywords and the occurrence probability of each keyword.
Step S402, determining the keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are the keywords appearing in the text information to be classified in the keyword list.
Step S403, determining a product of the occurrence probabilities of the keywords of the text information to be classified as the classification probability.
In particular, it is assumed that the probability of occurrence passes through αiRepresenting that 4 keywords are assumed to be respectively a first English, a first English course, an English course and a senior English teacher; the corresponding probability of occurrence is respectively alpha1、α2、α3、α4Specifically, assume that the probability of occurrence of the first english language is α10.6, the probability of occurrence of the first english lesson is α20.5, the probability of English course appearing is alpha30.7, the probability of occurrence of a senior english teacher is α40.4; the continuous product of the occurrence probabilities is alpha1234=0.6*0.5*0.7*0.4=0.084。
And S404, determining that the classified text information is negatively related to the classified topic in response to the classification probability being smaller than a preset threshold value.
Specifically, the negative correlation means that the classification text information is a negative sample, and it is assumed that the classification text information is whether to recommend a certain set course, and since the classification text information is negatively correlated with the classification subject, that is, a certain set course is not recommended in the classification information.
In a possible implementation manner, if the classification probability is greater than a preset threshold, it is determined that the classification text information is positively correlated with the classification topic, where the positive correlation is that the classification text information is a positive sample, and it is assumed that the classification text information is whether to recommend a certain set course, and since the classification text information is positively correlated with the classification topic, a certain set course is recommended in the classification information.
In the embodiment of the present invention, a determination manner of the threshold is the same as the specific implementation manner in step S205, and is not described herein again.
In the embodiment of the present invention, fig. 5 is a flowchart of a method for determining a keyword list according to the embodiment of the present invention. As shown in fig. 5, the method specifically includes the following steps:
step S500, obtaining a history sample data set corresponding to the classification subject and at least one candidate keyword corresponding to the history sample data set.
Specifically, the historical sample data includes historical positive sample data and historical negative sample data.
For example, assuming that the classification theme is whether a mathematical course is recommended to the user, after obtaining the text information of the dialog between the user and the worker, the text information flag corresponding to the case that the worker recommends the mathematical course to the user is 1, which is divided into historical positive sample data, and the text information flag corresponding to the case that the worker does not recommend the mathematical course to the user is 0, which is divided into historical negative sample data; the marks are only examples, and are not limited in the actual use process.
Supposing that the determined candidate keywords are word1, word2 and word3 … … word n; the word n is determined according to different classification topics, and assuming that the classification topic is a course, the candidate keyword is a total price of the corresponding course, a mode of taking the course, a duration of the course, a unit price of the course, and a idiom for introducing the course, for example, a first english course, a senior english teacher, and other words in the above text are only exemplarily described, and are specifically determined according to an actual situation.
Step S501, determining the occurrence probability of each candidate keyword in the historical positive sample data and the historical negative sample data according to the historical sample data.
In one possible implementation, the probability of occurrence of the historical positive sample data is expressed as αiRepresenting by beta the probability of occurrence in said historical negative sample dataiSpecifically, as shown in table 1:
TABLE 1
Figure BDA0003002434180000131
And step S502, determining keywords according to the occurrence probability.
Specifically, the candidate keyword whose occurrence probability in the historical positive sample data is greater than the set multiple of the occurrence probability in the historical negative sample data is determined as the keyword.
For example, the candidate keyword having an occurrence probability in the historical positive sample data that is 3 times greater than the occurrence probability in the historical negative sample data is determined as the keyword. For example, if the probability of occurrence of word1, word2, and word4 in the historical positive sample data is 3 times or more of the probability of occurrence in the historical positive sample data, word1, word2, and word4 are determined as keywords.
Step S503, generating the keyword list according to the keywords and the occurrence probability.
Specifically, it is assumed that the keyword list is two columns, one column is the keywords, and the other column is the occurrence probability corresponding to the keywords.
In the embodiment of the present invention, fig. 6 is a flowchart of a method for recognizing text contents in the embodiment of the present invention. As shown in fig. 6, the method specifically includes the following steps:
and step S600, acquiring audio data to be processed.
Step S601, inputting the audio data into an automatic voice recognition model, and outputting the text information to be processed.
In particular, the Automatic Speech Recognition (ASR) model aims to convert the vocabulary content in human Speech into computer-readable input and to interact with a computer, for example, keystrokes, binary codes or character sequences.
And step S602, acquiring text information to be classified.
Step S603, a keyword list corresponding to the classification subject is obtained, wherein the keyword list comprises a plurality of predetermined keywords and the occurrence probability of each keyword.
Step S604, determining the keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are the keywords appearing in the text information to be classified in the keyword list.
Step S605, determining the classification probability of the text information to be classified and the classification subject correlation according to the occurrence probability corresponding to the key words of the text information to be classified.
And step S606, determining the relevance of the classification text information and the classification subject according to the classification probability.
Fig. 7 is a schematic diagram of an apparatus for text content recognition according to an embodiment of the present invention. As shown in fig. 7, the apparatus of the present embodiment includes an acquisition unit 701, a first determination unit 702, a second determination unit 703, and a third determination unit 704.
The acquiring unit 701 is used for acquiring text information to be classified; the obtaining unit 701 is further configured to obtain a keyword list corresponding to the classification subject, where the keyword list includes a plurality of predetermined keywords and an occurrence probability of each keyword; a first determining unit 702, configured to determine, according to the keyword list, a keyword of the text information to be classified, where the keyword of the text information to be classified is a keyword appearing in the text information to be classified in the keyword list; a second determining unit 703, configured to determine, according to the occurrence probability of the keyword of the text information to be classified, the classification probability that the text information to be classified is associated with the classification topic; a third determining unit 704, configured to determine a relevance of the classified text information to the classification subject according to the classification probability.
In the embodiment of the invention, the classified text information is assumed to be the communication record of the working personnel and the user, and when the classification subject is the set course, the relevance between the classified text information and the classification subject is determined according to the classification probability, namely, whether the working personnel recommend the set course suitable for the user to the user is judged according to the classification probability, so that the accuracy of judging whether the working personnel recommend the course suitable for the user to the user is improved.
Fig. 8 is a schematic diagram of an electronic device of an embodiment of the invention. The electronic device shown in fig. 8 is a general-purpose data processing apparatus comprising a general-purpose computer hardware structure including at least a processor 81 and a memory 82. The processor 81 and the memory 82 are connected by a bus 83. The memory 82 is adapted to store instructions or programs executable by the processor 81. Processor 81 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 81 implements the processing of data and the control of other devices by executing instructions stored by the memory 82 to perform the method flows of embodiments of the present invention as described above. The bus 83 connects the above components together, and also connects the above components to a display controller 84 and a display device and an input/output (I/O) device 85. Input/output (I/O) devices 85 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 85 are coupled to the system through an input/output (I/O) controller 86.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, various aspects of embodiments of the invention may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, various aspects of embodiments of the invention may take the form of: a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.
Any combination of one or more computer-readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of embodiments of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to: electromagnetic, optical, or any suitable combination thereof. The computer readable signal medium may be any of the following computer readable media: is not a computer readable storage medium and may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of embodiments of the present invention may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, Smalltalk, C + +, and the like; and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package; executing in part on a user computer and in part on a remote computer; or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention described above describe various aspects of embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (13)

1. A method for text content recognition, the method comprising:
acquiring text information to be classified;
acquiring a keyword list corresponding to a classification subject, wherein the keyword list comprises a plurality of predetermined keywords and the occurrence probability of each keyword;
determining keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are keywords appearing in the text information to be classified in the keyword list;
determining the classification probability of the text information to be classified and the classification subject correlation according to the occurrence probability corresponding to the keywords of the text information to be classified;
and determining the relevance of the classification text information and the classification subject according to the classification probability.
2. The method of claim 1, wherein the classification probability is used to characterize that the text information to be classified is positively correlated with the classification topic;
the determining, according to the occurrence probability, a classification probability that the text information to be classified is associated with the classification topic specifically includes:
determining a plurality of first difference values, wherein each first difference value is the difference between 1 and the occurrence probability of the keywords of the text information to be classified;
determining a product of the plurality of first differences as the classification probability.
3. The method of claim 2, wherein said determining the relevance of said classified textual information to said classification topic according to said classification probability comprises:
and determining that the classified text information is positively correlated with the classification subject in response to the classification probability being smaller than a preset threshold value.
4. The method of claim 1, wherein the classification probability is used to characterize that the text information to be classified is negatively correlated with the classification topic;
the determining, according to the occurrence probability, a classification probability that the text information to be classified is associated with the classification topic specifically includes:
and determining the continuous product of the occurrence probabilities of the keywords of the text information to be classified as the classification probability.
5. The method as claimed in claim 4, wherein said determining the relevance of said classified text information to said classification topic according to said classification probability comprises:
and in response to the classification probability being smaller than a preset threshold value, determining that the classification text information is negatively related to the classification subject.
6. The method of claim 1, wherein the determining of the keyword list comprises:
acquiring a historical sample data set corresponding to a classification subject and at least one candidate keyword corresponding to the historical sample data set, wherein the historical sample data comprises historical positive sample data and historical negative sample data;
determining the occurrence probability of each candidate keyword in the historical positive sample data and the historical negative sample data according to the historical sample data;
determining keywords according to the occurrence probability;
and generating the keyword list according to the keywords and the occurrence probability.
7. The method according to claim 6, wherein the determining the keyword according to the occurrence probability specifically includes:
and determining the candidate keywords with the occurrence probability in the historical positive sample data being greater than the set multiple of the occurrence probability in the historical negative sample data as the keywords.
8. A method according to claim 3 or claim 5, wherein the threshold value is predetermined in accordance with a receiver operating characteristic, ROC, curve.
9. The method of claim 8, wherein the threshold is predetermined based on a receiver operating characteristic ROC curve, comprising:
determining a first proportion and a second proportion of the ROC curve, wherein the first proportion is a proportion which is correctly judged as a positive sample when all actually positive samples exist, and the second proportion is a proportion which is wrongly judged as a positive sample segment when all actually negative samples exist;
determining a maximum value of the first ratio and second ratio difference as the threshold value.
10. The method of claim 1, further comprising:
acquiring audio data to be processed;
and inputting the audio data into an automatic voice recognition model, and outputting the text information to be processed.
11. An apparatus for text content recognition, the apparatus comprising:
the acquiring unit is used for acquiring text information to be classified;
the obtaining unit is further configured to obtain a keyword list corresponding to the classification subject, where the keyword list includes a plurality of predetermined keywords and an occurrence probability of each keyword;
the first determining unit is used for determining the keywords of the text information to be classified according to the keyword list, wherein the keywords of the text information to be classified are the keywords appearing in the text information to be classified in the keyword list;
the second determining unit is used for determining the classification probability of the text information to be classified and the classification subject according to the occurrence probability corresponding to the key words of the text information to be classified;
and the third determining unit is used for determining the relevance of the classification text information and the classification subject according to the classification probability.
12. A computer-readable storage medium on which computer program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1-10.
13. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-10.
CN202110351284.9A 2021-03-31 2021-03-31 Text content identification method and device, readable storage medium and electronic equipment Pending CN113051369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110351284.9A CN113051369A (en) 2021-03-31 2021-03-31 Text content identification method and device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110351284.9A CN113051369A (en) 2021-03-31 2021-03-31 Text content identification method and device, readable storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113051369A true CN113051369A (en) 2021-06-29

Family

ID=76516733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110351284.9A Pending CN113051369A (en) 2021-03-31 2021-03-31 Text content identification method and device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113051369A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113595886A (en) * 2021-07-29 2021-11-02 北京达佳互联信息技术有限公司 Instant messaging message processing method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113595886A (en) * 2021-07-29 2021-11-02 北京达佳互联信息技术有限公司 Instant messaging message processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11714861B2 (en) Query selection method and system
US7853446B2 (en) Generation of codified electronic medical records by processing clinician commentary
US8630856B2 (en) Relative delta computations for determining the meaning of language inputs
KR101983975B1 (en) Method for automatic document classification using sentence classification and device thereof
US20150309984A1 (en) Learning language models from scratch based on crowd-sourced user text input
EP3491641B1 (en) Acoustic model training using corrected terms
CN110060674B (en) Table management method, device, terminal and storage medium
CN109597874B (en) Information recommendation method, device and server
CN111179935B (en) Voice quality inspection method and device
CN111475627B (en) Method and device for checking solution deduction questions, electronic equipment and storage medium
CN112364661B (en) Data detection method and device, readable storage medium and electronic equipment
CN111695338A (en) Interview content refining method, device, equipment and medium based on artificial intelligence
CN105808197A (en) Information processing method and electronic device
US11049409B1 (en) Systems and methods for treatment of aberrant responses
CN111696528A (en) Voice quality inspection method and device, quality inspection equipment and readable storage medium
CN113051369A (en) Text content identification method and device, readable storage medium and electronic equipment
CN111079433A (en) Event extraction method and device and electronic equipment
US20120266128A1 (en) Collaborative development support system, collaborative development support method and recording medium
CN110263135B (en) Data exchange matching method, device, medium and electronic equipment
CN109670040B (en) Writing assistance method and device, storage medium and computer equipment
CN111597310B (en) Sensitive content detection method, device, equipment and medium
CN114186041A (en) Answer output method
CN113011162A (en) Reference resolution method, device, electronic equipment and medium
CN107256214B (en) Junk information judgment method and device and server cluster
CN110889273A (en) Data processing method, data processing apparatus, storage medium, and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination