CN112182158A - Automatic document classification method, device, equipment and storage medium - Google Patents

Automatic document classification method, device, equipment and storage medium Download PDF

Info

Publication number
CN112182158A
CN112182158A CN202011052067.1A CN202011052067A CN112182158A CN 112182158 A CN112182158 A CN 112182158A CN 202011052067 A CN202011052067 A CN 202011052067A CN 112182158 A CN112182158 A CN 112182158A
Authority
CN
China
Prior art keywords
text data
category
phrases
initial text
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011052067.1A
Other languages
Chinese (zh)
Inventor
高越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202011052067.1A priority Critical patent/CN112182158A/en
Publication of CN112182158A publication Critical patent/CN112182158A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application belongs to the technical field of big data, and relates to an automatic document classification method, which comprises the following steps: when initial text data is received, acquiring a stored associated data table; acquiring basic words in the associated data table, and carrying out standardized adjustment on the initial text data according to the basic words to obtain fixed phrases and residual phrases of the initial text data; acquiring a preset category corresponding to the initial text data, and determining the category proportion of each fixed phrase and the remaining phrases in the preset category; and determining the text category to which the initial text data belongs according to the category ratio. The application also provides a document automatic classification device, equipment and a storage medium. Further, the present application relates to blockchain techniques, the initial text data may be stored in a blockchain. The method and the device for automatically classifying the documents realize automatic classification of the documents and improve efficiency and accuracy of automatic classification of the documents.

Description

Automatic document classification method, device, equipment and storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a method and an apparatus for automatically classifying documents, a computer device, and a storage medium.
Background
Currently, with the rapid development of information technology, more and more information is recorded by various documents. By classifying different documents, different document processing modes can be determined, and therefore information processing efficiency is improved. However, in the process of classifying different documents, the problems of low efficiency and low classification accuracy often exist.
The traditional document classification usually adopts a manual judgment classification mode or a classification mode according to keywords of the document. Various artificial subjective deviations often exist in the document classification mode judged manually, so that the problem of inaccurate classification is caused, and when the document data volume is large, the rapid and efficient classification of the documents often cannot be realized. For the mode of classifying the documents according to the keywords, the problems of inaccurate keyword judgment and low document data fault tolerance rate exist. Thus, how to ensure the accuracy of classification while efficiently classifying documents is the technical problem to be solved.
Disclosure of Invention
The embodiment of the application aims to provide a method and a device for automatically classifying documents, a computer device and a storage medium, so as to solve the technical problem of low document classification accuracy.
In order to solve the above technical problem, an embodiment of the present application provides an automatic document classification method, which adopts the following technical solutions:
when initial text data is received, acquiring a stored associated data table;
acquiring basic words in the associated data table, and carrying out standardized adjustment on the initial text data according to the basic words to obtain fixed phrases and residual phrases of the initial text data;
acquiring a preset category corresponding to the initial text data, and determining the category proportion of each fixed phrase and the remaining phrases in the preset category;
and determining the text category to which the initial text data belongs according to the category ratio.
Further, the step of performing normalized adjustment on the initial text data according to the basic words to obtain fixed phrases and remaining phrases of the initial text data specifically includes:
dividing the initial text data into a plurality of phrases according to the basic words, calculating the vocabulary matching degree of the phrases and the basic words, and determining the phrases as fixed phrases when the vocabulary matching degree is greater than or equal to a first preset threshold value;
and when the vocabulary matching degree is smaller than a first preset threshold value, extracting the residual text data except the fixed phrases in the initial text data, and dividing the residual text data according to a preset semantic rule to obtain the residual phrases in the initial text data.
Further, the step of dividing the initial text data into a plurality of phrases according to the basic word specifically includes:
acquiring the length of a basic word in the associated data table;
and dividing the initial text data according to the length to obtain a plurality of phrases corresponding to the initial text data.
Further, the step of dividing the remaining text data according to a preset semantic rule to obtain remaining word groups in the initial text data specifically includes:
dividing the residual text data into a plurality of residual word groups according to a preset semantic rule, and calculating the paraphrase matching degree of the residual word groups and the basic words;
and replacing the residual word groups in the initial text data with the basic words when the paraphrase matching degree is greater than or equal to a second preset threshold value.
Further, the step of determining the category proportion of each fixed phrase and the remaining phrases in the preset category specifically includes:
acquiring a historical service record, and respectively counting the association times of the fixed phrases and the rest phrases with the services corresponding to the preset categories according to the historical service record;
and respectively calculating the ratio of the association times of the fixed phrases and the residual phrases under different preset categories to the total association times under all the preset categories, and respectively associating the calculated ratio with the category ratios of the fixed phrases and the residual phrases under the preset categories.
Further, the step of determining the text category to which the initial text data belongs according to the category proportion specifically includes:
selecting the maximum value of the corresponding category proportion in each fixed phrase or the residual phrases as the phrase proportion of the fixed phrases or the residual phrases;
and summing the phrase ratios belonging to the same preset category in the initial text data, calculating the weight value of the total category ratio of the summation result under the same preset category, and determining the preset category with the largest weight value as the text category of the initial text data.
Further, after the step of determining the text category to which the initial text data belongs according to the category proportion, the method further includes:
acquiring a processing object of the initial text data according to the text type;
processing the initial text data based on the processing object.
In order to solve the above technical problem, an embodiment of the present application further provides an automatic document classification device, which adopts the following technical solutions:
the acquisition module is used for acquiring the stored associated data table when the initial text data is received;
the adjusting module is used for acquiring basic words in the associated data table and carrying out normalized adjustment on the initial text data according to the basic words to obtain fixed phrases and residual phrases of the initial text data;
the first confirming module is used for acquiring a preset category corresponding to the initial text data and determining the category proportion of each fixed phrase and the remaining phrases under the preset category;
and the second confirmation module is used for determining the text category to which the initial text data belongs according to the category proportion.
In order to solve the technical problem, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores computer-readable instructions, and the processor implements the steps of the above document automatic classification method when executing the computer-readable instructions.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when executed by a processor, the computer-readable instructions implement the steps of the above document automatic classification method.
The method comprises the steps that when initial text data are received, a stored association data table is obtained, a plurality of basic words are stored in the association data table, and the initial text data can be regulated in a standardized mode according to the basic words; acquiring basic words in the associated data table, and carrying out standardized adjustment on the initial text data according to the basic words to obtain fixed phrases and residual phrases of the initial text data; by carrying out normalized adjustment on the initial text data, the initial text data can better accord with a target specification, the workload is reduced, and the classification of the initial text data is more accurate and rapid; when fixed phrases and residual phrases of the initial text data are obtained, acquiring preset categories corresponding to the initial text data, and determining the category ratio of each fixed phrase and each residual phrase under the preset categories; the text category to which the initial text data belongs is determined according to the category proportion, so that automatic and rapid classification of the text data is realized, workload in classifying a large amount of text data is reduced, the classification efficiency of the text data is improved, the fault tolerance rate of the text data is increased, the classification of the text data is more accurate, and the processing efficiency of the text data is further improved.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for automatically classifying documents according to the present application;
FIG. 3 is a schematic diagram of the structure of one embodiment of an automatic document sorting apparatus according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Reference numerals: the automatic document classification device 400 includes: an obtaining module 401, an adjusting module 402, a first confirming module 403, and a second confirming module 404.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the document automatic classification method provided in the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the document automatic classification apparatus is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram of one embodiment of a method for automatically classifying documents in accordance with the present application is shown. The automatic document classification method comprises the following steps:
step S201, when initial text data is received, a stored associated data table is obtained;
in this embodiment, the initial text data is text data acquired for the first time, where if the received data is voice data, the initial text data is text data converted from the voice data. When initial text data is received, acquiring an association data table, wherein a plurality of basic words are stored in the association data table, and the initial text data can be regulated in a standardized manner according to the basic words in the association data table.
It is emphasized that the initial text data may also be stored in a node of a block chain in order to further ensure privacy and security of the initial text data.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Step S202, obtaining basic words in the associated data table, and carrying out standardized adjustment on the initial text data according to the basic words to obtain fixed phrases and residual phrases of the initial text data;
in this embodiment, when performing the normalized adjustment on the initial text data according to the associated data table, a basic word in the associated data table is obtained, where the basic word is a preset basic vocabulary, and the vocabulary in the initial text data may be replaced according to the basic word, where the normalized adjustment is to adjust the initial text data to the text data meeting the target specification. Specifically, the basic word is matched with the initial text data, and a phrase which can be completely matched with the basic word in the initial text data is used as a fixed phrase. When the fixed phrase is obtained, the text data in the initial text data except the fixed phrase is the residual text data, and the residual text data is divided according to a preset semantic rule, so that the residual phrase corresponding to the initial text data can be obtained.
Step S203, acquiring a preset category corresponding to the initial text data, and determining the category ratio of each fixed phrase and the remaining phrases in the preset category;
in this embodiment, a preset category of the initial text data is obtained, where the preset category is all corresponding classification categories when performing category classification on the initial text data, for example, the preset category may be classified into an autonomous region selection service, a list separation service, and the like according to a category of a service in an insurance scene, and the preset category is associated with a scene corresponding to the current initial text data. Under different preset categories, the fixed phrases and the remaining phrases corresponding to the initial text data may have different proportions, and the proportions may be determined according to a preset comparison table. For example, the preset comparison table includes a category a and a category B, and the preset proportion of the fixed phrase in the category a is 80%, and the preset proportion in the category B is 20%; the preset percentage of the remaining phrases in the category a is 60%, and the preset percentage in the category B is 40%. And if the preset categories exist, the category ratios of each phrase in the initial text data are correspondingly calculated.
And step S204, determining the text category to which the initial text data belongs according to the category proportion.
In this embodiment, when the category ratios of each fixed phrase and the remaining phrases in the initial text data in different preset categories are obtained, the category ratios in the same preset category are summed to obtain the total ratio of the initial text data in the preset category. And comparing the total occupation ratios of all the preset categories, and taking the preset category with the maximum total occupation ratio as the text category to which the initial text data belongs. For example, in the A, B, C phrases in the initial text data, the category occupation ratios under the number 1 preset category are respectively 20%, 70% and 80%, the category occupation ratios under the number 2 preset category are respectively 80%, 30% and 20%, the sum of the category occupation ratios is 170% for the initial text data under the number 1 preset category, and 130% for the initial text data under the number 2 preset category, and thus, the initial text data is determined to belong to the text category of the number 1 preset category.
The method and the device have the advantages that automatic and rapid classification of the text data is realized, workload when a large amount of text data are classified is reduced, classification efficiency of the text data is improved, fault tolerance of the text data classification is improved, the classification of the text data is more accurate, and processing efficiency of the text data is further improved.
In some embodiments of the present application, the performing the normalized adjustment on the initial text data according to the basic word to obtain a fixed phrase and a remaining phrase of the initial text data includes:
dividing the initial text data into a plurality of phrases according to the basic words, calculating the vocabulary matching degree of the phrases and the basic words, and determining the phrases as fixed phrases when the vocabulary matching degree is greater than or equal to a first preset threshold value;
and when the vocabulary matching degree is smaller than a first preset threshold value, extracting the residual text data except the fixed phrases in the initial text data, and dividing the residual text data according to a preset semantic rule to obtain the residual phrases in the initial text data.
In this embodiment, when the initial text data is subjected to normalized adjustment according to the associated data table, the basic words in the associated data table are obtained, and the initial text data is divided into a plurality of word groups according to the basic words in the associated data table, where the division may be performed according to the length of the basic words or according to the attributes of the basic words. When a plurality of phrases divided by initial text data are obtained, calculating the vocabulary matching degree of each phrase and the basic words, wherein the vocabulary matching degree is the font similarity of the phrases and the basic words, and when the vocabulary matching degree is larger than or equal to a first preset threshold value, determining the phrase as a fixed phrase and reserving the fixed phrase.
The process of matching the initial text data according to the font similarity is a process of roughly screening the initial text data, and the workload of text data processing can be reduced through the process; however, the word group obtained by dividing the initial text data according to the basic word may not conform to semantic logic, and therefore, when the matching degree of the words is smaller than a first preset threshold, that is, the word group is not similar to the basic word, the remaining text data in the initial text data except the fixed word group is extracted, and the remaining text data is divided according to a preset semantic rule. The preset semantic rule is a preset semantic rule, such as a mode of dividing according to grammatical relations of a subject, an object, a fixed language and the like of a sentence. And dividing the residual text data according to the preset semantic rule, wherein each divided phrase is the residual phrase of the initial text data, and the phrases except the fixed phrase divided according to the basic words are not directly used as the residual phrases.
In the embodiment, the phrases in the initial text data are primarily screened through the first preset threshold, so that the workload of text data processing is reduced, the fault tolerance rate during text data processing is improved, and the classification efficiency and the accuracy of the text data are further improved.
In some embodiments of the present application, the dividing the initial text data into a plurality of phrases according to the basic word includes:
acquiring the length of a basic word in the associated data table;
and dividing the initial text data according to the length to obtain a plurality of phrases corresponding to the initial text data.
In this embodiment, when the initial text data is divided according to the basic word in the associated data table, the length of the basic word in the associated data table may be obtained, and the initial text data is divided according to the length of the basic word. The length is the word length of the basic word, such as the length of bytes. When the length is obtained, the total length of the initial text data is obtained, the initial text data is divided according to a preset division sequence, for example, the preset division sequence is division from the end of the current initial text data sentence, and when the length of the basic word is obtained, division and matching are performed from the end of the initial text data sentence according to the length. Therefore, the vocabulary matching degree of the basic word and the word group in the initial text data can be calculated.
The embodiment realizes the division of the initial text data, and further enables the initial text data to be accurately classified according to the phrases obtained by the division.
In some embodiments of the present application, the dividing the remaining text data according to the preset semantic rule to obtain the remaining word groups in the initial text data includes:
dividing the residual text data into a plurality of residual word groups according to a preset semantic rule, and calculating the paraphrase matching degree of the residual word groups and the basic words;
and replacing the residual word groups in the initial text data with the basic words when the paraphrase matching degree is greater than or equal to a second preset threshold value.
In this embodiment, when the matching degree is smaller than a first preset threshold, extracting the remaining text data except the fixed phrase from the initial text data, where the remaining text data is text data that has not been subjected to any division processing; and when the residual text data is extracted, dividing the residual text data into a plurality of residual word groups according to a preset semantic rule, wherein the preset semantic rule is a mode of dividing the residual text data according to grammatical relations of subjects, objects, fixed words and the like of sentences. And when the residual phrases are obtained, calculating the paraphrase matching degree of the residual phrases and the basic words. And if the paraphrase matching degree is greater than or equal to a second preset threshold value, the residual phrase and the basic phrase are in the same paraphrase relationship, and replacing the residual phrase in the initial text data with the basic phrase.
Taking the initial text data as the '1000 yuan per client' premium as an example, wherein the 'client' and the '1000 yuan' are determined as a fixed phrase by matching with a basic word, and the 'one premium' is the residual text data which can be split into the 'one time' and the 'premium' according to a preset semantic rule; and if the paraphrase matching degree of the premium and the basic word insurance cost is greater than a second preset threshold value, the premium can be replaced by the insurance cost, and the insurance cost is the residual word group in the initial text data.
The embodiment realizes the normalization processing of the remaining phrases, so that non-professional vocabularies possibly existing in the initial text data can be replaced by the vocabularies with the comparative specifications set in the associated data table, and the subsequent document classification is more accurate.
In some embodiments of the present application, the determining a category proportion of each of the fixed phrases and the remaining phrases in the preset category includes:
acquiring a historical service record, and respectively counting the association times of the fixed phrases and the rest phrases with the services corresponding to the preset categories according to the historical service record;
and respectively calculating the ratio of the association times of the fixed phrases and the residual phrases under different preset categories to the total association times under all the preset categories, and respectively associating the calculated ratio with the category ratios of the fixed phrases and the residual phrases under the preset categories.
In this embodiment, when determining the category proportion of the fixed phrase and the remaining phrases in the preset category, the determination may also be performed according to a historical service record. Specifically, the historical service record is an association record of services corresponding to the fixed phrase and the remaining phrases and the preset category each time, for example, the fixed phrase a is associated with the separate service 1 time, and the fixed phrase B is associated with the self-selection service 3 times. And determining the association times of the fixed phrases and the rest phrases with the services corresponding to the preset categories according to the historical service record, and calculating the ratio of the total association times of the association times corresponding to the services of each preset category under all the preset categories, wherein the total association times is the sum of the association times of the fixed phrases or the rest phrases under all the preset categories. For example, if the association frequency of the fixed phrase a with the separate service is 2 times, and the association frequency with the autonomous region selection service is 3 times, the total association frequency is 5 times; the ratio of the fixed phrase a under the preset category serving as the separate service is 2 to 5, i.e., 2/5, and the ratio of the fixed phrase a under the preset category serving as the self-selection area is 3 to 5, i.e., 3/5. When the ratios of the fixed phrases and the remaining phrases in different preset categories are obtained through calculation, the ratios are respectively associated with the category ratios of the fixed phrases and the remaining phrases in the preset categories, namely the ratio corresponding to the fixed phrases is associated with the category ratio of the fixed phrases in the corresponding preset categories, and the ratio corresponding to the remaining phrases is associated with the category ratio of the remaining phrases in the corresponding preset categories.
According to the embodiment, the ratio of the current residual phrases to the fixed phrases is determined according to the association times of the services corresponding to the preset categories, so that the initial text data can be accurately classified according to the ratio, and the categories of the finally obtained initial text data can further accord with the expected service categories.
In some embodiments of the application, the determining the text category to which the initial text data belongs according to the category proportion includes:
selecting the maximum value of the corresponding category proportion in each fixed phrase or the residual phrases as the phrase proportion of the fixed phrases or the residual phrases;
and summing the phrase ratios belonging to the same preset category in the initial text data, calculating the weight value of the total category ratio of the summation result under the same preset category, and determining the preset category with the largest weight value as the text category of the initial text data.
In this embodiment, when determining the category of the initial text data according to the category proportion, the phrase proportions that belong to the same preset category in the initial text data may also be summed, where, for the same phrase in different preset categories, the category proportion with the largest proportion is the phrase proportion to which the phrase belongs, and the fixed phrase and the remaining phrases are collectively referred to as phrases in this embodiment. For example, A, B, C three phrases in the initial text data, where the category proportion of a under the number 1 preset category is 60%, and the category proportion under the number 2 preset category is 40%, then the phrase proportion of a takes 60% under the number 1 preset category; the category proportion of B under the No. 1 preset category is 30%, the category proportion under the No. 2 preset category is 70%, and the phrase proportion of B is 70% under the No. 2 preset category; the category proportion of C under the No. 1 preset category is 20%, the category proportion under the No. 2 preset category is 80%, and the phrase proportion of C under the No. 2 preset category is 80%. Summing the phrase ratios belonging to the same preset category in the initial text data, wherein as can be seen from the above example, the phrase ratios of B and C both belong to the phrase ratio under the number 2 preset category, the phrase ratio of a belongs to the phrase ratio under the number 1 preset category, and the phrase ratios under the number 2 preset category are summed to obtain a sum value of 150%, calculating a weight value of the sum value of the total category ratios under the same preset category, where the total category ratio is the sum of the category ratios of all the phrases under the same preset category, the total category ratio of the number 1 preset category is 110%, and the total category ratio of the number 2 preset category is 190%; summing the phrase ratios under the No. 1 preset category to obtain a sum value of 60%, wherein the weight value under the No. 1 preset category is a ratio of 60% to 110%, namely 54.5%; the sum value obtained by summing the phrase ratios under the No. 2 preset category is 150%, and the weight value under the No. 2 preset category is the ratio of 150% to 190%, namely 78.9%. The weight value reflects a weight value of the initial text data under a corresponding preset category, and the preset category with the largest weight value is determined as the text category corresponding to the current initial text data, where the text category of the initial text data in the above example is determined as the number 2 preset category.
The method and the device for determining the text type of the initial text data realize the determination of the text type of the initial text data and improve the accuracy of classification of the initial text data.
In some embodiments of the present application, after determining the text category to which the initial text data belongs according to the category proportion, the method further includes:
acquiring a processing object of the initial text data according to the text type;
processing the initial text data based on the processing object.
In this embodiment, when determining the text type of the initial text data, a processing object corresponding to the initial text data may also be obtained according to the text type, where the processing object is a processing object of a service corresponding to the initial text data, and different text types correspond to different processing objects, and the initial text data may be processed according to the determined processing object, for example, service tracking processing may be performed on the initial text data.
According to the method and the device, when the text type of the initial text data is determined, the initial text data is processed, and the processing efficiency of the text data is further improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an automatic document classification apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is specifically applicable to various electronic devices.
As shown in fig. 3, the automatic document classification device 400 according to the present embodiment includes: an obtaining module 401, an adjusting module 402, a first confirming module 403 and a second confirming module 404. Wherein:
an obtaining module 401, configured to obtain a stored associated data table when initial text data is received;
in this embodiment, the initial text data is text data acquired for the first time, where if the received data is voice data, the initial text data is text data converted from the voice data. When initial text data is received, acquiring an association data table, wherein a plurality of basic words are stored in the association data table, and the initial text data can be regulated in a standardized manner according to the basic words in the association data table.
It is emphasized that the initial text data may also be stored in a node of a block chain in order to further ensure privacy and security of the initial text data.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
An adjusting module 402, configured to obtain a basic word in the associated data table, and perform normalized adjustment on the initial text data according to the basic word to obtain a fixed phrase and a remaining phrase of the initial text data;
wherein the adjusting module 402 comprises:
the first dividing unit is used for dividing the initial text data into a plurality of phrases according to the basic words, calculating the vocabulary matching degree of the phrases and the basic words, and determining the phrases as fixed phrases when the vocabulary matching degree is larger than or equal to a first preset threshold value;
and the second dividing unit is used for extracting the residual text data except the fixed word group in the initial text data when the vocabulary matching degree is smaller than a first preset threshold value, and dividing the residual text data according to a preset semantic rule to obtain the residual word group in the initial text data.
Wherein the first division unit includes:
the obtaining subunit is used for obtaining the length of the basic word in the associated data table;
and the dividing subunit is used for dividing the initial text data according to the length to obtain a plurality of phrases corresponding to the initial text data.
Wherein the second dividing unit includes:
the calculation subunit is used for dividing the residual text data into a plurality of residual phrases according to a preset semantic rule and calculating the paraphrase matching degree of the residual phrases and the basic words;
and the replacing subunit is used for replacing the residual word groups in the initial text data with the basic words when the paraphrase matching degree is greater than or equal to a second preset threshold value.
In this embodiment, when performing the normalized adjustment on the initial text data according to the associated data table, a basic word in the associated data table is obtained, where the basic word is a preset basic vocabulary, and the vocabulary in the initial text data may be replaced according to the basic word, where the normalized adjustment is to adjust the initial text data to the text data meeting the target specification. Specifically, the basic word is matched with the initial text data, and a phrase which can be completely matched with the basic word in the initial text data is used as a fixed phrase. When the fixed phrase is obtained, the text data in the initial text data except the fixed phrase is the residual text data, and the residual text data is divided according to a preset semantic rule, so that the residual phrase corresponding to the initial text data can be obtained.
A first determining module 403, configured to obtain a preset category corresponding to the initial text data, and determine a category ratio of each fixed phrase and the remaining phrases in the preset category;
wherein, the first confirmation module 403 includes:
the statistical unit is used for acquiring historical service records and respectively counting the association times of the fixed phrases and the rest phrases with the services corresponding to the preset categories according to the historical service records;
and the association unit is used for respectively calculating the ratio of the association times of the fixed phrases and the residual phrases in different preset categories to the total association times in all the preset categories, and associating the calculated ratio with the category occupation ratios of the fixed phrases and the residual phrases in the preset categories.
In this embodiment, a preset category of the initial text data is obtained, where the preset category is all corresponding classification categories when performing category classification on the initial text data, for example, the preset category may be classified into an autonomous region selection service, a list separation service, and the like according to a category of a service in an insurance scene, and the preset category is associated with a scene corresponding to the current initial text data. Under different preset categories, the fixed phrases and the remaining phrases corresponding to the initial text data may have different proportions, and the proportions may be determined according to a preset comparison table. For example, the preset comparison table includes a category a and a category B, and the preset proportion of the fixed phrase in the category a is 80%, and the preset proportion in the category B is 20%; the preset percentage of the remaining phrases in the category a is 60%, and the preset percentage in the category B is 40%. And if the preset categories exist, the category ratios of each phrase in the initial text data are correspondingly calculated.
A second determining module 404, configured to determine a text category to which the initial text data belongs according to the category proportion.
Wherein the second confirmation module 404 comprises:
the first confirming unit is used for selecting the maximum value of the corresponding category proportion in each fixed phrase or the residual phrases as the phrase proportion of the fixed phrase or the residual phrases;
and the second confirming unit is used for summing the phrase ratios belonging to the same preset category in the initial text data, calculating the weight value of the total category ratio of the summation result under the same preset category, and determining the preset category with the largest weight value as the text category of the initial text data.
In this embodiment, when the category ratios of each fixed phrase and the remaining phrases in the initial text data in different preset categories are obtained, the category ratios in the same preset category are summed to obtain the total ratio of the initial text data in the preset category. And comparing the total occupation ratios of all the preset categories, and taking the preset category with the maximum total occupation ratio as the text category to which the initial text data belongs. For example, in the A, B, C phrases in the initial text data, the category occupation ratios under the number 1 preset category are respectively 20%, 70% and 80%, the category occupation ratios under the number 2 preset category are respectively 80%, 30% and 20%, the sum of the category occupation ratios is 170% for the initial text data under the number 1 preset category, and 130% for the initial text data under the number 2 preset category, and thus, the initial text data is determined to belong to the text category of the number 1 preset category.
The automatic document classification device in this embodiment further includes:
the first processing module is used for acquiring a processing object of the initial text data according to the text type;
and the second processing module is used for processing the initial text data based on the processing object.
In this embodiment, when determining the text type of the initial text data, a processing object corresponding to the initial text data may also be obtained according to the text type, where the processing object is a processing object of a service corresponding to the initial text data, and different text types correspond to different processing objects, and the initial text data may be processed according to the determined processing object, for example, service tracking processing may be performed on the initial text data.
The application provides a document automatic classification device has realized the automatic quick classification to text data, has reduced the work load when classifying a large amount of text data, has improved text data's classification efficiency to and the fault-tolerant rate of text data classification for text data's classification is more accurate, and has further improved the treatment effeciency to text data.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various application software, such as computer readable instructions of a document automatic classification method. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, such as computer readable instructions for executing the automatic document classification method.
The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.
The application provides computer equipment has realized the automatic quick classification to text data, has reduced the work load when classifying a large amount of text data, has improved text data's classification efficiency to and the fault-tolerant rate of text data classification for text data's classification is more accurate, and has further improved the treatment effeciency to text data.
The present application provides yet another embodiment, which is a computer-readable storage medium having computer-readable instructions stored thereon which are executable by at least one processor to cause the at least one processor to perform the steps of the method for automatically classifying documents as described above.
The computer-readable storage medium provided by the application realizes automatic and rapid classification of the text data, reduces the workload when classifying a large amount of text data, improves the classification efficiency of the text data, and improves the fault tolerance of the text data classification, so that the classification of the text data is more accurate, and the processing efficiency of the text data is further improved.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A method for automatically classifying documents is characterized by comprising the following steps:
when initial text data is received, acquiring a stored associated data table;
acquiring basic words in the associated data table, and carrying out standardized adjustment on the initial text data according to the basic words to obtain fixed phrases and residual phrases of the initial text data;
acquiring a preset category corresponding to the initial text data, and determining the category proportion of each fixed phrase and the remaining phrases in the preset category;
and determining the text category to which the initial text data belongs according to the category ratio.
2. The method according to claim 1, wherein the step of performing a normalized adjustment on the initial text data according to the basic words to obtain fixed phrases and remaining phrases of the initial text data specifically includes:
dividing the initial text data into a plurality of phrases according to the basic words, calculating the vocabulary matching degree of the phrases and the basic words, and determining the phrases as fixed phrases when the vocabulary matching degree is greater than or equal to a first preset threshold value;
and when the vocabulary matching degree is smaller than a first preset threshold value, extracting the residual text data except the fixed phrases in the initial text data, and dividing the residual text data according to a preset semantic rule to obtain the residual phrases in the initial text data.
3. The method according to claim 2, wherein the step of dividing the initial text data into a plurality of phrases according to the basic word specifically comprises:
acquiring the length of a basic word in the associated data table;
and dividing the initial text data according to the length to obtain a plurality of phrases corresponding to the initial text data.
4. The method according to claim 2, wherein the step of dividing the remaining text data according to a preset semantic rule to obtain remaining phrases in the initial text data specifically comprises:
dividing the residual text data into a plurality of residual word groups according to a preset semantic rule, and calculating the paraphrase matching degree of the residual word groups and the basic words;
and replacing the residual word groups in the initial text data with the basic words when the paraphrase matching degree is greater than or equal to a second preset threshold value.
5. The method according to claim 1, wherein the step of determining the category proportion of each of the fixed phrases and the remaining phrases in the preset category specifically comprises:
acquiring a historical service record, and respectively counting the association times of the fixed phrases and the rest phrases with the services corresponding to the preset categories according to the historical service record;
and respectively calculating the ratio of the association times of the fixed phrases and the residual phrases under different preset categories to the total association times under all the preset categories, and respectively associating the calculated ratio with the category ratios of the fixed phrases and the residual phrases under the preset categories.
6. The method according to claim 1, wherein the step of determining the text category to which the initial text data belongs according to the category proportion specifically comprises:
selecting the maximum value of the corresponding category proportion in each fixed phrase or the residual phrases as the phrase proportion of the fixed phrases or the residual phrases;
and summing the phrase ratios belonging to the same preset category in the initial text data, calculating the weight value of the total category ratio of the summation result under the same preset category, and determining the preset category with the largest weight value as the text category of the initial text data.
7. The method of automatically classifying documents according to claim 1, further comprising, after the step of determining the text category to which the initial text data belongs according to the category duty, the steps of:
acquiring a processing object of the initial text data according to the text type;
processing the initial text data based on the processing object.
8. An apparatus for automatically classifying documents, comprising:
the acquisition module is used for acquiring the stored associated data table when the initial text data is received;
the adjusting module is used for acquiring basic words in the associated data table and carrying out normalized adjustment on the initial text data according to the basic words to obtain fixed phrases and residual phrases of the initial text data;
the first confirming module is used for acquiring a preset category corresponding to the initial text data and determining the category proportion of each fixed phrase and the remaining phrases under the preset category;
and the second confirmation module is used for determining the text category to which the initial text data belongs according to the category proportion.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the method of automatically classifying a document according to any one of claims 1 to 7.
10. A computer-readable storage medium, having computer-readable instructions stored thereon, which, when executed by a processor, implement the steps of the method of automatically classifying documents according to any one of claims 1 to 7.
CN202011052067.1A 2020-09-29 2020-09-29 Automatic document classification method, device, equipment and storage medium Pending CN112182158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011052067.1A CN112182158A (en) 2020-09-29 2020-09-29 Automatic document classification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011052067.1A CN112182158A (en) 2020-09-29 2020-09-29 Automatic document classification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112182158A true CN112182158A (en) 2021-01-05

Family

ID=73947249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011052067.1A Pending CN112182158A (en) 2020-09-29 2020-09-29 Automatic document classification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112182158A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376151A (en) * 2018-01-31 2018-08-07 深圳市阿西莫夫科技有限公司 Question classification method, device, computer equipment and storage medium
CN110597988A (en) * 2019-08-28 2019-12-20 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376151A (en) * 2018-01-31 2018-08-07 深圳市阿西莫夫科技有限公司 Question classification method, device, computer equipment and storage medium
CN110597988A (en) * 2019-08-28 2019-12-20 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111368043A (en) Event question-answering method, device, equipment and storage medium based on artificial intelligence
WO2018040068A1 (en) Knowledge graph-based semantic analysis system and method
CN113326991B (en) Automatic authorization method, device, computer equipment and storage medium
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
WO2023029356A1 (en) Sentence embedding generation method and apparatus based on sentence embedding model, and computer device
WO2021218027A1 (en) Method and apparatus for extracting terminology in intelligent interview, device, and medium
CN110427453B (en) Data similarity calculation method, device, computer equipment and storage medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN112632278A (en) Labeling method, device, equipment and storage medium based on multi-label classification
CN112395391B (en) Concept graph construction method, device, computer equipment and storage medium
CN112836521A (en) Question-answer matching method and device, computer equipment and storage medium
CN112181835A (en) Automatic testing method and device, computer equipment and storage medium
CN112468658A (en) Voice quality detection method and device, computer equipment and storage medium
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN114722199A (en) Risk identification method and device based on call recording, computer equipment and medium
CN110750619A (en) Chat record keyword extraction method and device, computer equipment and storage medium
CN112669850A (en) Voice quality detection method and device, computer equipment and storage medium
CN113010542A (en) Service data processing method and device, computer equipment and storage medium
CN112417886A (en) Intention entity information extraction method and device, computer equipment and storage medium
CN117195886A (en) Text data processing method, device, equipment and medium based on artificial intelligence
CN115545753A (en) Partner prediction method based on Bayesian algorithm and related equipment
CN114912003A (en) Document searching method and device, computer equipment and storage medium
CN112182158A (en) Automatic document classification method, device, equipment and storage medium
CN113988223A (en) Certificate image recognition method and device, computer equipment and storage medium
CN114219664A (en) Product recommendation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination