CN114996428A - Information extraction method and device based on enterprise WeChat, electronic equipment and storage medium - Google Patents

Information extraction method and device based on enterprise WeChat, electronic equipment and storage medium Download PDF

Info

Publication number
CN114996428A
CN114996428A CN202210746223.7A CN202210746223A CN114996428A CN 114996428 A CN114996428 A CN 114996428A CN 202210746223 A CN202210746223 A CN 202210746223A CN 114996428 A CN114996428 A CN 114996428A
Authority
CN
China
Prior art keywords
text
target
time
selected time
backlog
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210746223.7A
Other languages
Chinese (zh)
Inventor
李珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202210746223.7A priority Critical patent/CN114996428A/en
Publication of CN114996428A publication Critical patent/CN114996428A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The application provides an information extraction method, an information extraction device, electronic equipment and a storage medium based on enterprise WeChat, wherein the information extraction method based on the enterprise WeChat comprises the following steps: acquiring an enterprise WeChat dialogue text between a first user and a second user; identifying the enterprise WeChat dialogue text based on a first preset model, and obtaining at least one target backlog in the enterprise WeChat dialogue text; obtaining a dialog text where the target backlog is located; extracting a first pre-selection time text based on the dialog text where the target backlog is located; and identifying whether the first pre-selected time text is the target to-do time corresponding to the target to-do item or not based on the second pre-set model identification and the short sentence where the first pre-selected time text is located, and if so, determining the first pre-selected time text as the target to-do time. The method and the device can accurately extract the target to-do matters and the target to-do time in the enterprise WeChat.

Description

Information extraction method and device based on enterprise WeChat, electronic equipment and storage medium
Technical Field
The application relates to the technical field of computers, in particular to an information extraction method and device based on enterprise WeChat, electronic equipment and a storage medium.
Background
In the information era, enterprise WeChat is an important means for the communication between a bank financial manager and a client, and can effectively develop services and marketing. In enterprise WeChat of bank financing managers and clients, some matters which need to be dealt with by the financing managers for the clients at a later time are often mentioned, such as checking the balance of assets for the clients the next day, giving gifts to the clients home on weekends and the like. At present, the industry has a plurality of open-source text classification and time information extraction tools, but the open-source tools are not suitable for enterprise WeChat scenes, such as problems that the time of non-invited conversations in the conversations can be identified, the time period of non-time point can be identified, and the like.
Disclosure of Invention
An object of the embodiment of the present application is to provide an information extraction method and apparatus based on enterprise WeChat, an electronic device, and a storage medium, which are used to accurately extract target to-do matters and target to-do time in enterprise WeChat.
In a first aspect, an embodiment of the present application provides an information extraction method based on enterprise WeChat, where the method includes:
acquiring an enterprise WeChat dialogue text between a first user and a second user;
identifying the enterprise WeChat dialogue text based on a first preset model, and obtaining at least one target backlog in the enterprise WeChat dialogue text;
obtaining a dialog text where the target backlog is located;
extracting a first pre-selection time text based on the dialog text where the target backlog is located;
and identifying whether the first pre-selected time text is the target to-do time corresponding to the target to-do item or not based on the second pre-set model identification and the short sentence where the first pre-selected time text is located, and if so, determining the first pre-selected time text as the target to-do time.
In the application, by obtaining an enterprise WeChat dialogue text between a first user and a second user, the enterprise WeChat dialogue text can be identified based on a first preset model, at least one target to-do-item in the enterprise WeChat dialogue text is obtained, and by obtaining the dialogue text where the target to-do-item is located, a first pre-selection time text can be extracted based on the dialogue text where the target to-do-item is located, whether the first pre-selection time text is the target to-do-item corresponding to the target to-do-item can be identified based on a second preset model, and a short sentence where the first pre-selection time text is located can be identified, and if so, the first pre-selection time text is determined as the target to-do-item.
Compared with the prior art, after the target backlogs are extracted, whether the first pre-selected time text is the target backlog time corresponding to the target backlogs or not can be identified based on the short sentence where the second pre-selected model identifies and the first pre-selected time text is located, so that the backlog time only related to the target backlogs can be extracted, the extraction of the backlog time unrelated to the target backlogs is finally avoided, and the extraction accuracy of the backlog time is improved.
In an alternative embodiment, the method further comprises:
when two or more first pre-selected time texts exist and the two or more first pre-selected time texts are in the same short sentence, splitting the same short sentence into two sub-short sentences, and identifying whether the two first pre-selected time texts are the target to-do-time corresponding to the target to-do-matter based on the second preset model identification and the two sub-short sentences respectively.
In this optional embodiment, when there are two or more first pre-selected time texts and the two or more first pre-selected time texts are in the same short sentence, the same short sentence is split into two sub-short sentences, and then whether the two first pre-selected time texts are the target to-do-time corresponding to the target to-do-matter can be identified based on the second preset model and the two sub-short sentences, so that whether one first pre-selected time text is the target to-do-time corresponding to the target to-do-matter can be identified more accurately.
In an optional embodiment, the extracting a first pre-selected time text based on the dialog text of the target to-do-item includes:
performing sentence segmentation on the dialog text in which the target backlog is located based on LAC to determine the part of speech of each word in the dialog text in which the target backlog is located;
and extracting words with the part of speech being TIME based on the part of speech of each word in the dialog text of the target backlog, and obtaining the first pre-selected TIME text.
In this optional embodiment, the dialog text where the target to-do-go item is located is subjected to sentence segmentation based on LAC, so that a part of speech of each word in the dialog text where the target to-do-go item is located can be determined, and further, based on the part of speech of each word in the dialog text where the target to-do-go item is located, a word whose part of speech is TIME can be extracted, and the first pre-selection TIME text is obtained.
In an optional implementation manner, after extracting a first pre-selected time text based on the dialog text where the target to-do-go-no is located, identifying whether the first pre-selected time text is before the target to-do-go-no corresponding to the target to-do-go-no based on a short sentence where the second pre-selected time text is located and identifying whether the first pre-selected time text is the target to-do-go-no corresponding to the target to-do-go-no, where the method further includes:
judging whether the first pre-selected time text is preset time period information or festival information or not based on a first preset regular expression;
and when the first pre-selected time text is the preset time period information or the first pre-selected time text is the festival information, excluding the first pre-selected time text.
In this optional embodiment, when the first preselected time text is the preset time period information or the first preselected time text is the festival information, by excluding the first preselected time text, it is possible to avoid excluding the festival information and the time text of which the time period information does not belong to a specific time point, thereby further improving the accuracy of extracting the target to-do time.
In an optional implementation manner, after extracting a first pre-selected time text based on the dialog text where the target to-do-item exists, identifying, based on a second preset model, a short sentence where the first pre-selected time text exists, and identifying whether the first pre-selected time text is before the target to-do-item corresponding to the target to-do-item, where the method further includes:
determining a time dimension of the first pre-selected time text based on a second preset regular expression;
when two or more first preselected time texts exist, splicing the two first preselected time texts which belong to the same time dimension and are connected through a first preset text symbol;
and splicing two first pre-selected time texts which belong to different time dimensions and are connected through a second preset text symbol.
In this optional embodiment, when there are two or more first preselected time texts, the accuracy of extracting the target to-do time can be further improved by splicing the two first preselected time texts belonging to the same time dimension and connected by the first preset text symbol. On the other hand, the extraction accuracy of the target to-do time can be further improved by splicing the two first pre-selected time texts which belong to different time dimensions and are connected through the second preset text symbol.
In an alternative embodiment, the method further comprises:
acquiring a conversation text with label data and a conversation text without label data;
generating a training set of each item category and a test set of each item category based on the dialogue text with the labeling data and the dialogue text without the labeling data;
training the first pre-set model based on a training set for each of the event categories and a testing set for each of the event categories.
In this optional embodiment, by obtaining the dialog text with the label data and the dialog text without the label data, a training set and a test set for each item category can be generated based on the dialog text with the label data and the dialog text without the label data, and the first preset model can be trained based on the training set and the test set for each item category.
In an alternative embodiment, the method further comprises:
and carrying out synonym replacement, random deletion of any two words in the sentence, random disorder of the sequence of any two words in the sentence and random insertion of any two words in other sentences of the same category into the sentence operation on the training set of each item category so as to expand the number of the training sets of each item category and the number of the test sets of each item category.
In this optional embodiment, by expanding the number of training sets of each item category and the number of test sets of each item category, the training data amount of the first preset model can be increased, thereby improving the training and recognition accuracy of the first preset model.
In a second aspect, an embodiment of the present application provides an information extraction apparatus based on enterprise WeChat, where the apparatus includes:
the first acquisition module is used for acquiring an enterprise WeChat dialogue text between a first user and a second user;
the first identification module is used for identifying the enterprise wechat dialogue text based on a first preset model and obtaining at least one target backlog in the enterprise wechat dialogue text;
the second acquisition module is used for acquiring the dialog text where the target backlog is located;
the extraction module is used for extracting a first pre-selection time text based on the dialog text where the target backlog is located;
and the second identification module is used for identifying whether the first pre-selection time text is the target to-do time corresponding to the target to-do item or not based on a second preset model and the short sentence where the first pre-selection time text is located, and if so, determining the first pre-selection time text as the target to-do time.
The device can acquire the enterprise wechat dialogue text between a first user and a second user by executing an enterprise wechat-based information extraction method, further can identify the enterprise wechat dialogue text based on a first preset model, obtains at least one target backlog in the enterprise wechat dialogue text, further can identify whether the first preselected time text is the target backlog corresponding to the target backlog or not by acquiring the dialogue text where the target backlog is located based on the dialogue text where the target backlog is located, and further can identify a short sentence where the first preselected time text is located based on a second preset model, and if so, determines the first preselected time text as the target backlog.
Compared with the prior art, after the target backlogs are extracted, whether the first pre-selected time text is the target backlog time corresponding to the target backlogs or not can be identified based on the short sentence where the second pre-selected model identifies and the first pre-selected time text is located, so that the backlog time only related to the target backlogs can be extracted, the extraction of the backlog time unrelated to the target backlogs is finally avoided, and the extraction accuracy of the backlog time is improved.
In a third aspect, an embodiment of the present application provides an electronic device, including:
a processor; and
a memory configured to store machine readable instructions that, when executed by the processor, perform the enterprise wechat-based information extraction method of any of the preceding embodiments.
The electronic equipment can further obtain an enterprise wechat dialogue text between a first user and a second user by executing an enterprise wechat-based information extraction method, further can identify the enterprise wechat dialogue text based on a first preset model, obtains at least one target backlog in the enterprise wechat dialogue text, further can identify whether the first preselected time text is the target backlog corresponding to the target backlog or not by obtaining the dialogue text where the target backlog is located based on the dialogue text where the target backlog is located and then can identify a short sentence where the first preselected time text is located based on a second preset model, and if so, determines the first preselected time text as the target backlog.
Compared with the prior art, after the target backlogs are extracted, whether the first pre-selected time text is the target backlog time corresponding to the target backlogs or not can be identified based on the short sentence where the second pre-selected model identifies and the first pre-selected time text is located, so that the backlog time only related to the target backlogs can be extracted, the extraction of the backlog time unrelated to the target backlogs is finally avoided, and the extraction accuracy of the backlog time is improved.
In a fourth aspect, an embodiment of the present application provides a storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the enterprise wechat-based information extraction method according to any one of the foregoing embodiments.
The storage medium can acquire an enterprise wechat dialogue text between a first user and a second user by executing an enterprise wechat-based information extraction method, further can identify the enterprise wechat dialogue text based on a first preset model, and obtain at least one target backlog in the enterprise wechat dialogue text, further can identify whether the first preselected time text is the target backlog corresponding to the target backlog or not based on the dialogue text where the target backlog is located by acquiring the dialogue text where the target backlog is located, and further can identify a short sentence where the first preselected time text is located based on a second preset model, and if so, determine the first preselected time text as the target backlog.
Compared with the prior art, after the target backlogs are extracted, whether the first pre-selected time text is the target backlog time corresponding to the target backlogs or not can be identified based on the short sentence where the second pre-selected model identifies and the first pre-selected time text is located, so that the backlog time only related to the target backlogs can be extracted, the extraction of the backlog time unrelated to the target backlogs is finally avoided, and the extraction accuracy of the backlog time is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flowchart of an information extraction method based on enterprise WeChat disclosed in an embodiment of the present application;
fig. 2 is a schematic structural diagram of an information extraction apparatus based on enterprise WeChat disclosed in an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart of an information extraction method based on enterprise WeChat disclosed in an embodiment of the present application, and as shown in fig. 1, the method in the embodiment of the present application includes the following steps:
101. acquiring an enterprise WeChat dialogue text between a first user and a second user;
102. identifying an enterprise wechat conversation text based on a first preset model, and obtaining at least one target backlog in the enterprise wechat conversation text;
103. obtaining a dialog text where a target to-do item is located;
104. extracting a first pre-selection time text based on the dialog text where the target backlog is located;
105. and identifying whether the first pre-selected time text is the target to-do time corresponding to the target to-do item or not based on the second pre-set model identification and the short sentence where the first pre-selected time text is located, and if so, determining the first pre-selected time text as the target to-do time.
In the embodiment of the application, the enterprise wechat conversation text between the first user and the second user is obtained, the enterprise wechat conversation text can be identified based on the first preset model, at least one target backlog in the enterprise wechat conversation text is obtained, the conversation text where the target backlog is located is obtained, the first pre-selected time text can be extracted based on the conversation text where the target backlog is located, the short sentence where the first pre-selected time text is located can be identified based on the second preset model, whether the first pre-selected time text is the target backlog corresponding to the target backlog or not is identified, and if yes, the first pre-selected time text is determined as the target backlog.
Compared with the prior art, after the target backlogs are extracted, whether the first pre-selected time text is the target backlog time corresponding to the target backlogs or not can be identified based on the short sentence where the second pre-selected model is identified and the first pre-selected time text is located, so that the backlog time only related to the target backlogs can be extracted, the extraction of the backlog time unrelated to the target backlogs is finally avoided, and the extraction accuracy of the backlog time is improved.
In this alternative embodiment, as an example, in some scenarios, in a sentence including the target to-do-matter, not all times are the target to-do-matter, for example, "the calendar is yes, tomorrow me is on business, saturday morning me sends you again" how you can send, "tomorrow" extracted by LAC in this sentence is not the target to-do-matter, "saturday morning" is the true target to-do-matter, and therefore, the true target to-do-matter needs to be identified, and at this time, the short sentence where the text is located based on the second preset model and the first pre-selected time text is identified, so that the true target to-do-matter can be identified. Specifically, the second preset model is trained in advance to recognize that the sentences such as 'calendar is' and 'tomorrow is' are not the target to-do time. On the other hand, the second preset model may be a FastText model.
In the embodiment of the present application, when the first pre-selected time text is determined as the target to-do time, the target to-do time may be converted into a standard time, for example, the next thursday is converted into "2021-09-20", and if there is no specific time, the default is "10: 00: 00". The identification accuracy rate of the final waiting time is 0.72 percent
In an alternative implementation, the method of the embodiments of the present application further includes the steps of:
when the first pre-selected time texts are two or more than two, and the two or more than two first pre-selected time texts are in the same short sentence, splitting the same short sentence into two sub-short sentences, and identifying whether the two first pre-selected time texts are the target to-do time corresponding to the target to-do matters or not respectively based on the second preset model and the two sub-short sentences.
In this optional embodiment, when there are two or more first pre-selected time texts and the two or more first pre-selected time texts are in the same short sentence, the same short sentence is divided into two sub-short sentences, and then whether the two first pre-selected time texts are the target to-do-times corresponding to the target to-do-matters can be identified based on the second preset model and the two sub-short sentences, so that whether one first pre-selected time text is the target to-do-time corresponding to the target to-do-matters can be identified more accurately.
In an alternative embodiment, the steps of: extracting a first pre-selected time text based on the dialog text of the target to-do item, and comprising the following sub-steps of:
performing sentence segmentation on the dialog text in which the target backlog is located based on the LAC to determine the part of speech of each word in the dialog text in which the target backlog is located;
and extracting words with the part of speech being TIME based on the part of speech of each word in the dialog text of the target backlog, and obtaining a first pre-selected TIME text.
In this optional embodiment, the dialog text where the target to-do-go item is located is subjected to sentence segmentation based on the LAC, so that the part of speech of each word in the dialog text where the target to-do-go item is located can be determined, and further, based on the part of speech of each word in the dialog text where the target to-do-go item is located, a word whose part of speech is TIME can be extracted, and the first pre-selection TIME text is obtained.
In this alternative embodiment, the LAC is an open-source word segmentation tool, wherein the LAC can perform sentence segmentation on the dialog text in which the target to-do-go is located, and provide a part-of-speech at each TIME, so that words with part-of-speech TIME can be extracted based on the part-of-speech of each word in the dialog text in which the target to-do-go is located, wherein the words with part-of-speech TIME represent words related to TIME.
In an optional embodiment, after extracting the first pre-selected time text based on the dialog text where the target to-do-go-no is located, and before identifying whether the first pre-selected time text is the target to-do-go-no corresponding to the target to-do-go-no based on the short sentence where the second pre-set model identifies and the first pre-selected time text is located, the method further includes:
judging whether the first pre-selected time text is preset time period information or festival information or not based on a first preset regular expression;
and when the first pre-selected time text is the preset time period information or the first pre-selected time text is the festival information, excluding the first pre-selected time text.
In this optional embodiment, the extracted first preselected time text may be specific time such as "afternoon of wednesday", or may be time period information such as "one month" and "one year", or festival information such as "national celebration" and "end noon", and further, since the extracted target to-be-handled time needs to be specific time, the extracted time period information such as "one month" and "one year" and festival information such as national celebration "and" end noon "are excluded to avoid extracting the time period information such as" one month "and" one year "and festival information such as national celebration" and "end noon" as the target to-be-handled time, and finally, the extraction accuracy of the target to-be-handled time is further improved.
In this optional embodiment, the first preset regular expression may be: 'mid-autumn', 'national day', 'New year', 'end noon', 'Christmas', 'now', 'Up time', 'New', 'past | this | down ] \\ w \ (month | multiple month | week | Jade | week) | (left and right | before | after) |,' [ < Lambda > ] this | down ] } w \ week | before | after (left and right | before | after) }, [ < Lambda > ] this | down | bright | back Jade | week | today ] \\ day (left and right | before | after) |, 'before'; 'every' just ',' week | month | Point | before | week | before | after | week | Save | Twen | Tu | Twent | week '+ $ seven' + |.
In an optional implementation manner, after extracting the first pre-selected time text based on the dialog text where the target to-do-go item is located, identifying, based on the second preset model, a short sentence where the first pre-selected time text is located, and identifying whether the first pre-selected time text is before the target to-do-go time corresponding to the target to-do-go item, the method in the embodiment of the present application further includes the following steps:
determining the time dimension of the first pre-selected time text based on a second preset regular expression;
when two or more first preselected time texts exist, splicing the two first preselected time texts which belong to the same time dimension and are connected through a first preset text symbol;
and splicing two first pre-selected time texts which belong to different time dimensions and are connected through a second preset text symbol.
In some scenarios, the dialog text where the target to-do-matter exists may be an expression word such as "or" where, for example, the dialog text where the target to-do-matter exists is "wednesday afternoon", and at this time, since the LAC is split into two words based on "wednesday afternoon" and "three o 'clock", the two first preset time texts of "wednesday afternoon" and "three o' clock" are obtained through the splitting of the LAC, however, actually, since the two first preset time texts of "wednesday afternoon" and "three o 'clock" are spliced together to form one completed target to-do-matter, the "wednesday afternoon" and "three o' clock" are needed to be spliced, so that the present alternative embodiment, when there are two or more first preset time texts, performs the splicing by two first preset times belonging to the same time dimension and connected by the first preset text symbols, the extraction accuracy of the target to-do time can be further improved. On the other hand, the extraction accuracy of the target to-do time can be further improved by splicing two first pre-selected time texts which belong to different time dimensions and are connected through a second preset text symbol.
In this optional implementation manner, the time dimension may be one of a day dimension, a half-day dimension, and a time dimension, where the second preset regular expression for identifying the day dimension may be: "$ today," $ tomorrow, "$ 'week | worship," $ day two, three, four, five, six days, "$' weekend," $ 'w + [ day | j _' ]. Further, the second preset regular expression for identifying the dimension of a half day may be: [ $ 'morning', $ 'afternoon', $ 'noon', $ 'morning', $ 'evening', $ 'morning', ]. Further, the second preset regular expression for identifying the time dimension may be: "\ w + [ dot clock | hour ] ', ' \ d +: d + ' ].
In this alternative embodiment, the first predetermined text symbol refers to the word "or", "or". The second predetermined text symbol refers to words such as "and", "d", and "d".
In an alternative implementation, the method of the embodiments of the present application further includes the steps of:
acquiring a conversation text with label data and a conversation text without label data;
generating a training set and a testing set of each item category based on the dialog text with the label data and the dialog text without the label data;
a first pre-set model is trained based on the training set for each event category and the testing set for each event category.
In this optional embodiment, by acquiring the dialog text with the label data and the dialog text without the label data, a training set and a test set for each item category can be generated based on the dialog text with the label data and the dialog text without the label data, and the first preset model can be trained based on the training set and the test set for each item category.
In this optional embodiment, the first preset model may be a FastText model of a 3-gram, where after the FastText model of the 3-gram is trained by the data of the present application, the accuracy of identifying the target backlogs may reach 81%, and the target backlogs have better identification accuracy.
In this alternative embodiment, the dialog text with the annotation data may be obtained based on the historical dialog text, for example, the historical dialog text is displayed to the annotating person, so that the annotating person annotates the historical dialog text, and finally the dialog text with the annotation data is obtained. Further, the annotation data includes a to-do note and a to-do time note, for example, for a history dialog text labeled "visit client" and "10 o 'clock on thursday", respectively, the "visit client" is labeled to the to-do note, and the "10 o' clock on thursday" is labeled to the to-do time.
In this alternative embodiment, one historical dialog text may be annotated with 33 to-do notes and 166 to-do time notes.
In this alternative embodiment, the total number of dialog texts with and without annotation data may be 3179 pieces, that is, 3179 pieces of dialog texts may be used as the number for training the first preset model.
In this alternative embodiment, the category of the event refers to the category of the event, wherein the category of the event may be one or more of a campaign offer, a product recommendation, a product redemption reminder, a card opening, a gift being given at home, an express delivery document, a customer visit, and a business query, that is, the application trains the first preset model based on the category of the event, for example, trains the first preset model based on the training set and the test set of the category "campaign offer" so that the first preset model can identify the category "campaign offer".
In the present alternative embodiment, the generation of the training set for each event category and the test set for each event category based on the dialog text with the label data and the dialog text without the label data means that the dialog text with the label data and the dialog text without the label data are divided into the data of a plurality of event categories, and then the data of each event category is divided into the training set and the test set in turn, for example, the dialog text with the label data and the dialog text without the label data are divided into the data of 3 event categories, wherein the data of each event category is 20, 30 and 50, and further 18 of the 20 data are used as the training set, 2 of the 20 data are used as the test set, 27 of the 30 data are used as the training set, and 3 of the 30 data are used as the test set, 45 of the 50 data were used as training sets and 5 of the 50 data were used as test sets.
In this alternative embodiment, the ratio of training set to test set in each event category may be 9: 1. Further, random sampling can be adopted to obtain data satisfying the ratio of 9:1 from the data in each item category as a training set and a testing set.
In an alternative implementation, the method of the embodiments of the present application further includes the steps of:
and carrying out synonym replacement, random deletion of any two words in the sentence, random disorder of the sequence of any two words in the sentence and random insertion of any two words of other sentences of the same category into the sentence operation on the training set of each item category so as to expand the number of the training sets of each item category and the number of the test sets of each item category.
In this optional embodiment, since the data amount of each event category is not distributed in a balanced manner, that is, the data amount of some event categories is large, the data amount of some event categories is small, and the learning mechanism of the first preset model is biased to the categories with large data amount, and the categories with low data amount are not well recognized, the training data amount of the first preset model, especially the data amount of the event categories with small data amount, can be increased by expanding the number of training sets of each event category and the number of test sets of each event category, and the first preset module is trained based on more samples, so that the training and recognition accuracy of the first preset model is improved.
In this optional embodiment, an eda (easy Data augmentation) method may be used to perform Data expansion on the training set of each item category and the test set of each item category, that is, to perform synonym replacement on the training set of each item category, randomly delete any two words in a sentence, randomly disturb the order of any two words in the sentence, and randomly insert any two words of other sentences of the same category into the sentence operation, where in the eda (easy Data augmentation) method, a synnyms tool is used to randomly find the most similar word of any two words in the sentence and replace the original word, thereby implementing the synonym replacement operation. On the other hand, when any two-word operation in the random deletion sentence is performed, in order to avoid deleting the keyword representing the category, the TF-IDF algorithm may be adopted for each item category to extract the keyword and extract the word unique to each item category, and the words are not deleted at the time of random deletion.
In this optional embodiment, optionally, the training set of one item category may be split into 4-molecule training sets, so as to perform operations of synonym replacement, random deletion of any two words in the sentence, random disordering of the order of any two words in the sentence, and random insertion of any two words in other sentences of the same category into the sentence, respectively.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of an information extraction device based on enterprise WeChat disclosed in the embodiment of the present application, and as shown in fig. 2, the device in the embodiment of the present application includes the following functional modules:
a first obtaining module 201, configured to obtain an enterprise WeChat dialog text between a first user and a second user;
the first identification module 202 is configured to identify an enterprise wechat conversation text based on a first preset model, and obtain at least one target to-do item in the enterprise wechat conversation text;
the second obtaining module 203 is configured to obtain a dialog text where the target to-do item is located;
the extraction module 204 is used for extracting a first pre-selection time text based on the dialog text where the target to-do item is located;
the second identifying module 205 is configured to identify, based on the second preset model and the short sentence where the first pre-selected time text is located, whether the first pre-selected time text is the target to-do time corresponding to the target to-do item, and if so, determine the first pre-selected time text as the target to-do time.
The device provided by the embodiment of the application can acquire the enterprise wechat dialogue text between a first user and a second user by executing the enterprise wechat-based information extraction method, further can identify the enterprise wechat dialogue text based on a first preset model, and obtain at least one target backlog in the enterprise wechat dialogue text, further can extract a first pre-selection time text based on the dialogue text where the target backlog is located by acquiring the dialogue text where the target backlog is located, further can identify whether the first pre-selection time text is the target backlog corresponding to the target backlog based on a short sentence where the second preset model and the first pre-selection time text are located, and if so, determines the first pre-selection time text as the target backlog.
Compared with the prior art, after the target backlogs are extracted, whether the first pre-selected time text is the target backlog time corresponding to the target backlogs or not can be identified based on the short sentence where the second pre-selected model is identified and the first pre-selected time text is located, so that the backlog time only related to the target backlogs can be extracted, the extraction of the backlog time unrelated to the target backlogs is finally avoided, and the extraction accuracy of the backlog time is improved.
Please refer to the related description of the first embodiment of the present application for other detailed descriptions of the apparatus in the embodiments of the present application, which are not repeated herein.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application, and as shown in fig. 1, the electronic device in the embodiment of the present application includes:
a processor 301; and
a memory 302 configured to store machine readable instructions that, when executed by the processor 301, perform the enterprise wechat-based information extraction method of any of the preceding embodiments.
The electronic device of the embodiment of the application can acquire the enterprise wechat dialogue text between the first user and the second user by executing the enterprise wechat-based information extraction method, further can recognize the enterprise wechat dialogue text based on the first preset model, and obtain at least one target backlog in the enterprise wechat dialogue text, further can extract the first pre-selection time text based on the dialogue text of the target backlog by acquiring the dialogue text of the target backlog, further can recognize whether the first pre-selection time text is the target backlog corresponding to the target backlog based on the short sentence of the second preset model and the first pre-selection time text, and if yes, determines the first pre-selection time text as the target backlog.
Compared with the prior art, after the target backlogs are extracted, whether the first pre-selected time text is the target backlog time corresponding to the target backlogs or not can be identified based on the short sentence where the second pre-selected model is identified and the first pre-selected time text is located, so that the backlog time only related to the target backlogs can be extracted, the extraction of the backlog time unrelated to the target backlogs is finally avoided, and the extraction accuracy of the backlog time is improved.
Example four
The present application provides a storage medium storing a computer program, and the computer program is executed by a processor to execute the enterprise wechat-based information extraction method according to any one of the foregoing embodiments.
The storage medium of the embodiment of the application can acquire the enterprise wechat dialogue text between the first user and the second user by executing the enterprise wechat-based information extraction method, further can identify the enterprise wechat dialogue text based on the first preset model, and obtain at least one target backlog in the enterprise wechat dialogue text, further can extract the first pre-selection time text based on the dialogue text where the target backlog is located by acquiring the dialogue text where the target backlog is located, further can identify whether the first pre-selection time text is the target backlog corresponding to the target backlog based on the short sentence where the second preset model and the first pre-selection time text are located, and if yes, determines the first pre-selection time text as the target backlog.
Compared with the prior art, after the target backlogs are extracted, whether the first pre-selected time text is the target backlog time corresponding to the target backlogs or not can be identified based on the short sentence where the second pre-selected model is identified and the first pre-selected time text is located, so that the backlog time only related to the target backlogs can be extracted, the extraction of the backlog time unrelated to the target backlogs is finally avoided, and the extraction accuracy of the backlog time is improved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is only a logical division, and other divisions may be realized in practice, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
It should be noted that the functions, if implemented in the form of software functional modules and sold or used as independent products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above embodiments are merely examples of the present application and are not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. An information extraction method based on enterprise WeChat is characterized by comprising the following steps:
acquiring an enterprise WeChat dialogue text between a first user and a second user;
identifying the enterprise WeChat dialogue text based on a first preset model, and obtaining at least one target backlog in the enterprise WeChat dialogue text;
obtaining a dialog text where the target backlog is located;
extracting a first pre-selection time text based on the dialog text where the target backlog is located;
and identifying whether the first pre-selected time text is the target to-do time corresponding to the target to-do item or not based on the second pre-set model identification and the short sentence where the first pre-selected time text is located, and if so, determining the first pre-selected time text as the target to-do time.
2. The method of claim 1, wherein the method further comprises:
when two or more first pre-selected time texts exist and the two or more first pre-selected time texts are in the same short sentence, splitting the same short sentence into two sub-short sentences, and identifying whether the two first pre-selected time texts are the target to-do-time corresponding to the target to-do-matter based on the second preset model identification and the two sub-short sentences respectively.
3. The method of claim 1, wherein extracting a first pre-selected time text based on the dialog text where the target to-do-item is located comprises:
performing sentence segmentation on the dialog text where the target backlog is located based on LAC to determine the part of speech of each word in the dialog text where the target backlog is located;
and extracting words with the part of speech being TIME based on the part of speech of each word in the dialog text of the target backlog, and obtaining the first pre-selected TIME text.
4. The method of claim 3, wherein after extracting the first pre-selected time text based on the dialog text for the target to-do-matter, the identifying whether the first pre-selected time text is the target to-do-matter corresponding to the target to-do-matter based on the short sentence for the second pre-selected time text is performed based on the second pre-set model, the method further comprises:
judging whether the first pre-selection time text is preset time period information or festival information or not based on a first preset regular expression;
and when the first pre-selected time text is the preset time period information or the first pre-selected time text is the festival information, excluding the first pre-selected time text.
5. The method of claim 3, wherein after extracting the first pre-selected time text based on the dialog text for the target to-do-item, identifying whether the first pre-selected time text is before the target to-do-item corresponding to the target to-do-item based on the short sentence for which the second pre-selected time text is identified based on the second pre-set model, the method further comprises:
determining a time dimension of the first pre-selected time text based on a second preset regular expression;
when two or more first preselected time texts exist, splicing the two first preselected time texts which belong to the same time dimension and are connected through a first preset text symbol;
and splicing two first pre-selected time texts which belong to different time dimensions and are connected through a second preset text symbol.
6. The method of claim 1, wherein the method further comprises:
acquiring a dialog text with label data and a dialog text without label data;
generating a training set of each item category and a test set of each item category based on the dialogue text with the labeling data and the dialogue text without the labeling data;
training the first pre-set model based on a training set for each of the event categories and a testing set for each of the event categories.
7. The method of claim 6, wherein the method further comprises:
and carrying out synonym replacement, random deletion of any two words in the sentence, random disorder of the sequence of any two words in the sentence and random insertion of any two words in other sentences of the same category into the sentence operation on the training set of each item category so as to expand the number of the training sets of each item category and the number of the test sets of each item category.
8. An information extraction device based on enterprise WeChat, the device comprising:
the first acquisition module is used for acquiring an enterprise WeChat dialogue text between a first user and a second user;
the first identification module is used for identifying the enterprise wechat conversation text based on a first preset model and obtaining at least one target backlog in the enterprise wechat conversation text;
the second acquisition module is used for acquiring the dialog text of the target backlog;
the extraction module is used for extracting a first pre-selection time text based on the dialog text where the target backlog is located;
and the second identification module is used for identifying whether the first pre-selection time text is the target to-do time corresponding to the target to-do item or not based on a second preset model and the short sentence where the first pre-selection time text is located, and if yes, determining the first pre-selection time text as the target to-do time.
9. An electronic device, comprising:
a processor; and
a memory configured to store machine readable instructions that, when executed by the processor, perform the enterprise wechat-based information extraction method of any of claims 1-7.
10. A storage medium, characterized in that the storage medium stores a computer program, the computer program being executed by a processor to perform the enterprise wechat-based information extraction method according to any one of claims 1 to 7.
CN202210746223.7A 2022-06-28 2022-06-28 Information extraction method and device based on enterprise WeChat, electronic equipment and storage medium Pending CN114996428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210746223.7A CN114996428A (en) 2022-06-28 2022-06-28 Information extraction method and device based on enterprise WeChat, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210746223.7A CN114996428A (en) 2022-06-28 2022-06-28 Information extraction method and device based on enterprise WeChat, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114996428A true CN114996428A (en) 2022-09-02

Family

ID=83037570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210746223.7A Pending CN114996428A (en) 2022-06-28 2022-06-28 Information extraction method and device based on enterprise WeChat, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114996428A (en)

Similar Documents

Publication Publication Date Title
US11663411B2 (en) Ontology expansion using entity-association rules and abstract relations
US9575936B2 (en) Word cloud display
US8560567B2 (en) Automatic question and answer detection
US11954140B2 (en) Labeling/names of themes
US8762375B2 (en) Method for calculating entity similarities
CN106934069B (en) Data retrieval method and system
US10078689B2 (en) Labeling/naming of themes
CN108170715B (en) Text structuralization processing method
US9697246B1 (en) Themes surfacing for communication data analysis
US9495357B1 (en) Text extraction
CN112667802A (en) Service information input method, device, server and storage medium
CN112926299A (en) Text comparison method, contract review method and audit system
CN114492368A (en) AI bid automatic scoring method, system and storage medium
CN113220885B (en) Text processing method and system
US20160034509A1 (en) 3d analytics
JP5400496B2 (en) System for creating articles based on the results of financial statement analysis
CN114996428A (en) Information extraction method and device based on enterprise WeChat, electronic equipment and storage medium
CN115496830A (en) Method and device for generating product demand flow chart
CN114065752A (en) Text-based risk equipment identification method and device and electronic equipment
JP2016031572A (en) Method of dividing term with appropriate granularity, computer for dividing term with appropriate granularity, and computer program therefor
CN113791860A (en) Information conversion method, device and storage medium
CN113836288B (en) Method and device for determining service detection result and electronic equipment
US11783112B1 (en) Framework agnostic summarization of multi-channel communication
CN115965018A (en) Training method of information generation model, information generation method and device
CN115099632A (en) Complaint service processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination