CN112732867B - File processing method and device - Google Patents

File processing method and device Download PDF

Info

Publication number
CN112732867B
CN112732867B CN202011602808.9A CN202011602808A CN112732867B CN 112732867 B CN112732867 B CN 112732867B CN 202011602808 A CN202011602808 A CN 202011602808A CN 112732867 B CN112732867 B CN 112732867B
Authority
CN
China
Prior art keywords
resource
file
cluster
newly added
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011602808.9A
Other languages
Chinese (zh)
Other versions
CN112732867A (en
Inventor
陈静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN202011602808.9A priority Critical patent/CN112732867B/en
Publication of CN112732867A publication Critical patent/CN112732867A/en
Application granted granted Critical
Publication of CN112732867B publication Critical patent/CN112732867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a file processing method and device. Wherein the method comprises the following steps: acquiring a plurality of resource files and constructing characteristic information of each resource file; clustering a plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters; and extracting the resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package. The invention solves the technical problem that a teacher cannot accurately find the proper teaching resources caused by the single teaching resource recommending method in the prior art.

Description

File processing method and device
Technical Field
The invention relates to the field of data processing, in particular to a file processing method and device.
Background
Along with the popularization of online education, the electronic resources are more and more abundant, the massive increase of resources enriches the choices of teachers, and more possibilities are brought to teaching, so that the teachers can apply various resources in teaching, enrich classroom contents and activate classroom atmosphere. However, a teacher is faced with numerous and numerous types of electronic resources, and it is often difficult to quickly and accurately select the resources intended by the teacher. In order to improve the matching efficiency of resources, a single resource recommendation method, such as problem recommendation, is often adopted in the prior art. However, in the complete teaching process, the teacher needs to use a combination of multiple types of resources to cover various resources needed by the teacher when teaching a specific content, for example, the teacher needs to prepare courseware, classroom or post-class exercise, and uses demonstration animation or knowledge point teaching video to consolidate the knowledge of students or promote interestingness.
Aiming at the problem that a teacher cannot accurately find a proper teaching resource due to the single teaching resource recommending method in the prior art, an effective solution is not proposed at present.
Disclosure of Invention
The embodiment of the invention provides a file processing method and device, which at least solve the technical problem that a teacher cannot accurately find a proper teaching resource due to a single teaching resource recommending method in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a method for processing a file, including: acquiring a plurality of resource files and constructing characteristic information of each resource file; clustering a plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters; and extracting the resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package.
Further, obtaining a plurality of resource files and constructing feature information of each resource file, including: acquiring text information in a resource file, and segmenting the text information; cleaning word segmentation results by stopping the word list; and performing text vectorization processing based on the cleaning result to obtain a text vector for representing the characteristic information.
Further, in the case that the resource file is a video file, obtaining text information in the resource file includes: acquiring caption data to obtain text information in the video file under the condition that the video file comprises the caption data; in the case where the video file does not include subtitle data, voice information in the video file is extracted and converted into text information.
Further, the method further comprises the steps of: creating a deactivation vocabulary corresponding to a file type of the resource file, wherein creating the deactivation vocabulary corresponding to the file type of the resource file comprises: word segmentation is carried out on the full resource files in the resource library, wherein the resource library comprises a plurality of types of resource files; screening out stop words corresponding to each type of resource files from word segmentation results of the full resource files, wherein the stop words corresponding to each type of resource files are determined according to the occurrence frequency of each stop word in each type of resource files; and generating a stop word list corresponding to the file type according to the stop words corresponding to each type of resource file.
Further, cleaning the word segmentation result by stopping the vocabulary, including: and cleaning the word segmentation result through the stop word list corresponding to the file type of the resource file.
Further, after performing text vectorization processing based on the preprocessing result to obtain a text vector for representing the feature information, the method further includes one or more of the following: performing scaling processing on the text vector through an activation function; and performing dimension reduction processing on the text vector.
Further, clustering the plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters, including: clustering a plurality of resource files based on the characteristic information of each resource file through a K-means clustering algorithm to generate a plurality of resource clusters.
Further, after clustering the plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters, the method further includes: receiving newly added resource files and constructing characteristic information of the newly added resource files; determining neighbor files of the newly added resource files according to the characteristic information of the newly added resource files and the characteristic information of the existing resource files; and dividing the newly added resource into any one of a plurality of resource clusters according to the distance relation between the newly added resource file and the neighbor file, or regenerating a resource cluster for the newly added resource.
Further, under the condition that the neighbor files all belong to the same first target resource cluster, dividing the newly added resource into any one of a plurality of resource clusters or regenerating a resource cluster for the newly added resource according to the distance relation between the newly added resource file and the neighbor file, including: acquiring a first distance between a newly added resource file and the mass center of a first target resource cluster; acquiring a second distance between a resource file farthest from the centroid in the first target resource cluster and the centroid; obtaining the average distance between all resource files in the first target resource cluster and the mass center; dividing the newly added resources into a first target resource cluster under the condition that the difference between the first distance and the second distance is smaller than or equal to the average distance; and regenerating a resource cluster for the newly added resource under the condition that the difference between the first distance and the second distance is larger than the average distance.
Further, under the condition that the neighbor files do not belong to the same resource cluster, dividing the newly added resource into any one of a plurality of resource clusters or regenerating a resource cluster for the newly added resource according to the distance relation between the newly added resource file and the neighbor files, including: acquiring the duty ratio of a resource cluster to which a neighbor file belongs; under the condition that a second target resource cluster and a third target resource cluster with the duty ratio difference smaller than a first preset value exist, acquiring a first average value of the distances between the newly added resource and the neighbor files belonging to the second target resource cluster, a second average value of the distances between the newly added resource and the neighbor files belonging to the third target resource cluster, a third average value of the distances between the neighbor files in the second target resource cluster and a fourth average value of the distances between the neighbor files in the third target resource cluster; if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, adding the newly added resource file and the neighbor file in the second target resource cluster into the third target resource cluster, wherein the ratio of the third target resource cluster to the second target resource cluster is higher, and the preset conditions comprise: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than the third average value, and the second average value is smaller than the fourth average value; if the first average value, the second average value, the third average value and the fourth average value do not meet the preset conditions, acquiring a resource cluster to which the centroid with the shortest distance to the newly added resource file belongs, adding the newly added resource file into the determined resource cluster, or regenerating a resource cluster for the newly added resource.
Further, each resource file has a corresponding file level, each resource cluster has a corresponding theme, and according to the received resource file extraction request, the resource files are extracted from at least one resource cluster to form a file package, and the file package is returned, including: the resource file extracts the request information in the request, wherein the request information comprises at least one of the following: extracting topics, extracting file grades and extracting quantity corresponding to each file type; under the condition that the resource file extraction request comprises an extraction theme and an extraction file grade, screening out resource files conforming to the extraction file grade from the resource clusters which are the same as the extraction theme; when the extraction number is included in the resource file extraction request, the resource files corresponding to the extraction number are randomly extracted from the resource files conforming to the extraction file level to form a file package, and the file package is returned.
Further, in the case that the extraction number is not included in the resource file extraction request, determining the extraction number according to the historical extraction behavior of the extraction subject, or extracting the resource file according to the preset extraction number; extracting the resource files from the first N resource clusters with the number of the resource files ordered from high to low under the condition that the extraction subject is not included in the resource file extraction request; in the case where the extraction file rank is not included in the resource file extraction request, the resource files corresponding to the extraction number are randomly extracted from the same resource cluster as the extraction subject to form a file package.
According to another aspect of the embodiment of the present invention, there is also provided a processing apparatus for a file, including: the acquisition module is used for acquiring a plurality of resource files and constructing characteristic information of each resource file; the clustering module is used for clustering a plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters; and the composition module is used for extracting resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package.
According to another aspect of embodiments of the present invention, there is also provided a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of the above.
According to another aspect of the embodiment of the present invention, there is also provided an intelligent interactive tablet, including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of the above.
In the embodiment of the invention, the resource clusters are obtained by constructing the characteristic information of the resource files and clustering a plurality of resource files, and the combination of a plurality of resources can be obtained from the resource clusters to form a file package according to the resource file extraction request. The method for combining the resource files can be used for generating a lesson preparation package for teaching, can generate a multi-resource combination with similar content and suitable for matched use by constructing text features of the resource files in the education field and correspondingly clustering, helps a teacher to quickly construct the lesson preparation package meeting requirements, reduces the time for the teacher to find resources and match different types of resources, solves the technical problem that the teacher cannot accurately find proper teaching resources due to the recommending method of single teaching resources in the prior art, and improves teaching efficiency.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of a method of processing a file according to an embodiment of the invention;
FIG. 2 is a flow chart of an alternative method of processing a file according to an embodiment of the present invention;
FIG. 3 is a flow chart of an alternative method of building a profile of a resource file according to an embodiment of the invention;
FIG. 4 is a schematic diagram of a document processing apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an alternative intelligent interactive tablet according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the implementation of the invention, the combination of the resources of multiple types is defined as a lesson preparation package, and the resources of multiple types in the lesson preparation package are connected with each other, so that the resource content has continuity. In the prior art, the association between resources is usually implemented by using labels of the resources themselves, for example, the titles, courseware, videos and the like under the same chapter are associated together. However, even in the same section, the contents related to different resources may be different, for example, the resource example is different in scene and unsuitable for matching, so that the association method of the resources in the prior art is inaccurate, and the teacher cannot accurately obtain the expected lesson preparation package.
Example 1
According to an embodiment of the present invention, there is provided an embodiment of a method for processing a file, it should be noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that shown or described herein.
FIG. 1 is a flow chart of a method of processing a file according to an embodiment of the present invention, as shown in FIG. 1, the method comprising the steps of:
step S101, a plurality of resource files are acquired, and feature information of each resource file is constructed.
The plurality of resource files refer to the same type or different types of resources related to the user's needs. In an alternative embodiment for generating a lesson preparation package, the plurality of resource files may include, but are not limited to, titles, courseware, video, and other multimedia resource files, and in particular, the title information may include: question text content (stem, options, answers), difficulty, associated chapters, etc.; courseware information may include: courseware content (containing text for each page), associated chapters, etc.; the video information includes: video content (including subtitles or audio for each frame), associated chapters, and so forth.
In order to characterize different types of resources, feature extraction is required for the different types of resources, including but not limited to vectorized text features. For example, in the embodiment of generating the lesson preparation package, since the text information of the questions, courseware and video is rich, text features of the questions, courseware and video need to be extracted, and for video resources without subtitles, text features need to be extracted after audio in the video resources are converted into text, and different types of resources are represented by the obtained text features.
Step S102, clustering a plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters.
Resource clusters can be understood as combinations of resources that are close in content. After vectorized characteristic information of each resource file is obtained, clustering can be carried out by utilizing vectors, and resource combinations with similar contents are obtained. For example, in the embodiment of generating the lesson preparation package, if the scope of the limited cluster is within a chapter, the resource files under the same chapter are clustered according to the text features of each resource file, and the resources with similar content are clustered together to obtain a plurality of resource clusters, where each resource cluster includes resource files of different types such as topics, courseware, video, and the like.
Step S103, extracting resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package.
A package of files may be understood as a combination of resource files that a user expects to obtain. The resource file extraction request is input by a user and can comprise keywords of resource files, the number of various types of resources and the like, at least one resource cluster matched with the keywords is returned by matching the keywords in the resource file extraction request and the keywords in the resource clusters, a resource source file is extracted from each returned resource cluster, and the resource files of different types extracted by the resource clusters are combined to form a file package. In the embodiment of generating the lesson preparation package, a teacher can take the keywords of the resource content, the number of various types of resources and the question difficulty level as the content of the resource file request, return the most relevant resource clusters by matching the keywords input by the teacher and the keywords of the resource clusters, randomly select the specified number of questions, courseware and videos in each resource cluster, and finally generate the lesson preparation package.
In an alternative embodiment of generating a lesson preparation package for teaching, resource information including questions, courseware, video is first obtained, wherein the question information includes question text content (questions stems, options, answers), difficulty, associated chapters, the courseware information includes courseware content (including text of each page), associated chapters, the video information includes video content (including subtitles or audio of each frame), associated chapters, and the like. And constructing text features of the resource information (namely extracting keywords of titles, courseware and videos), and carrying out vectorization processing on the text features to obtain text vectors. After the text vector of each resource information is obtained, the vector can be used for clustering, and the resource files with similar contents are gathered in the same resource cluster, for example, when the range of clustering is set as a section, the resources under the same section can be clustered according to the section label of the resources, and the resources under the same section are gathered together. And (3) inputting a resource file extraction request by a teacher and generating a lesson preparation package, for example, if the extraction request input by the teacher is a video file and a theme related to a certain theme of a certain chapter, selecting a corresponding number of video files and themes from a corresponding theme resource cluster under the chapter to generate the lesson preparation package.
In the embodiment, the resource clusters are obtained by constructing the characteristic information of the resource files and clustering a plurality of resource files, and the combination of a plurality of resources can be obtained from the resource clusters according to the resource file extraction request to form the file package. The method for combining the resource files can be used for generating a lesson preparation package for teaching, can generate a multi-resource combination with similar content and suitable for matched use by constructing text features of the resource files in the education field and correspondingly clustering, helps a teacher to quickly construct the lesson preparation package meeting requirements, reduces the time for the teacher to find resources and match different types of resources, solves the technical problem that the teacher cannot accurately find proper teaching resources due to the recommending method of single teaching resources in the prior art, and improves teaching efficiency.
As an alternative embodiment, obtaining a plurality of resource files, and constructing feature information of each resource file, including: acquiring text information in a resource file, and segmenting the text information; cleaning word segmentation results by stopping the word list; and performing text vectorization processing based on the cleaning result to obtain a text vector for representing the characteristic information.
Since text features can be obtained through a bag of words model or a word embedding model, both methods are constructed by text features with word granularity, the text needs to be segmented and cleaned before that. The word segmentation can be understood as splitting text information into word units with meaning, cleaning can be understood as filtering out inactive words in the obtained word units, for example, the side length of an equilateral triangle, after word segmentation, three word units of the equilateral triangle, the side length and the functional word without practical meaning (namely, the inactive word) are obtained, after cleaning the analysis result, two word units of the equilateral triangle and the side length are reserved, and vectorization processing is carried out on the two word units.
In an alternative embodiment, TF-IDF (Term Frequency-inverse text Frequency index) or Word2vec (Word to Vector, word Vector model) may be used to Vector the cleaned text.
As an optional embodiment, in the case that the resource file is a video file, obtaining text information in the resource file includes: acquiring caption data to obtain text information in the video file under the condition that the video file comprises the caption data; in the case where the video file does not include subtitle data, voice information in the video file is extracted and converted into text information.
It should be noted that, when the feature information of the resource file is represented by the text vector, it is necessary to convert the different kinds of resources into the text vector. For video resources without subtitles, the voice information in the video resources is converted into characters, the vectorized text features are extracted, and the acquired text features are used for representing different types of resources.
As an alternative embodiment, the method further comprises: creating a deactivation vocabulary corresponding to a file type of the resource file, wherein creating the deactivation vocabulary corresponding to the file type of the resource file comprises: word segmentation is carried out on the full resource files in the resource library, wherein the resource library comprises a plurality of types of resource files; screening out stop words corresponding to each type of resource files from word segmentation results of the full resource files, wherein the stop words corresponding to each type of resource files are determined according to the occurrence frequency of each stop word in each type of resource files; and generating a stop word list corresponding to the file type according to the stop words corresponding to each type of resource file.
It should be noted that in the education field, different types of file resources (for example, the topics, courseware, and video are different types of resource files) often include general but low-information text descriptions, for example, "the following description" in the topics, "the goal of the lesson" in the courseware, and "why" in the video, general stop words cannot cover these words, and thus a special stop word list needs to be built for the resources in the education field. Because the text expression styles of the titles, courseware and videos are different, the stop word list of different types of resource files needs to be independently constructed.
Taking courseware as an example, extracting texts of all courseware in a resource library, then performing word segmentation, cleaning through a general stop word list, counting the frequency of each word after word segmentation, and screening out words with higher frequency as new stop words to construct a special stop word list in the field of educational resources. The construction methods of the stop word list of the resource files of different types can be the same, and the construction methods of other types of stop words such as titles, videos and the like are the same.
As an alternative embodiment, the cleaning of the word segmentation result by disabling the vocabulary includes: and cleaning the word segmentation result through the stop word list corresponding to the file type of the resource file.
As an alternative embodiment, after performing text vectorization processing based on the preprocessing result to obtain a text vector for representing the feature information, the method further includes one or more of the following: performing scaling processing on the text vector through an activation function; and performing dimension reduction processing on the text vector.
Because nouns related to knowledge points can repeatedly appear in courseware and videos, for example, the word of the intersecting line can repeatedly appear in different pages in the courseware for teaching the intersecting line, the number of the word in the courseware is large, and the text features of the intersecting line can be directly extracted to cause that other words are low in weight and have small influence. Therefore, after the text vector is extracted, an element-wise scaling is performed by using an activation function, and each dimension value of the text vector is limited to be between 0 and 1. For example, the activation function may be a sigmoid function, the formula of which is as follows:
where x is a text vector.
In addition, the directly extracted text vectors are sparse in high dimension, so that the dimensions of different types of text vectors (e.g. topics, courseware and videos) can be reduced uniformly, and the dimension reduction method can be PCA (Principal Components Analysis, principal component analysis), isomap (Isometric Feature Mapping, equidistant feature mapping), T-SNE (T-distributed stochastic neighbor embedding, T distribution field embedding algorithm) and the like.
As an optional embodiment, clustering the plurality of resource files based on the characteristic information of each resource file, generating a plurality of resource clusters includes: clustering a plurality of resource files based on the characteristic information of each resource file through a K-means clustering algorithm to generate a plurality of resource clusters.
The K-means clustering algorithm, namely the K-means algorithm, divides samples into different clusters through the distance between the samples, and optimizes the clustering effect through iterative centroids. By clustering based on the vector features of the resource files, resources with similar contents can be clustered together to form a plurality of resource clusters.
As an optional embodiment, after clustering the plurality of resource files based on the characteristic information of each resource file, generating a plurality of resource clusters, the method further includes: receiving newly added resource files and constructing characteristic information of the newly added resource files; determining neighbor files of the newly added resource files according to the characteristic information of the newly added resource files and the characteristic information of the existing resource files; and dividing the newly added resource into any one of a plurality of resource clusters according to the distance relation between the newly added resource file and the neighbor file, or regenerating a resource cluster for the newly added resource.
For the database in the education field, a large number of new resources are put into storage (such as newly uploaded titles, courseware, videos and the like) every day, the newly put resources can be added into a proper resource cluster on the basis of an original cluster by adopting an incremental clustering method, and the original cluster can be a plurality of resource clusters obtained through a K-means algorithm or a newly built cluster. The new resources are classified by the incremental clustering method, so that the efficiency of resource clustering can be improved, and the time for processing the resource data can be saved.
As an optional embodiment, in the case that the neighbor files all belong to the same first target resource cluster, dividing the newly added resource into any one of the plurality of resource clusters according to a distance relationship between the newly added resource file and the neighbor file, or regenerating one resource cluster for the newly added resource, including: acquiring a first distance between a newly added resource file and the mass center of a first target resource cluster; acquiring a second distance between a resource file farthest from the centroid in the first target resource cluster and the centroid; obtaining the average distance between all resource files in the first target resource cluster and the mass center; dividing the newly added resources into a first target resource cluster under the condition that the difference between the first distance and the second distance is smaller than or equal to the average distance; and regenerating a resource cluster for the newly added resource under the condition that the difference between the first distance and the second distance is larger than the average distance.
Specifically, text vectors are extracted for each newly-warehoused resource, distances between the new resource and all the clustered resources are calculated, and k adjacent neighbors of the new resource are obtained. If all k neighbors belong to a first target resource cluster (i.e. the same cluster), calculating a distance dist_c (i.e. a first distance) between a new resource and the centroid of the cluster, an average distance dist_mean between all samples of the cluster and the centroid, and a distance dist_max (i.e. a second distance) between a farthest resource file in the cluster and the centroid, and dividing the cluster into the first target resource cluster or a new cluster when determining the new resource according to the following two conditional formulas:
a) If dist_c-dist_max < = dist_mean, i.e. the distance between the new resource and the centroid is not too large, dividing the new resource into a first target resource cluster;
b) If dist_c-dist_max > dist_mean, i.e. the distance of the new resource from the centroid is too large, a new cluster is created for the new resource alone.
As an optional embodiment, in a case that the neighbor files do not belong to the same resource cluster, dividing the newly added resource into any one of the plurality of resource clusters according to a distance relationship between the newly added resource file and the neighbor file, or regenerating one resource cluster for the newly added resource, including: acquiring the duty ratio of a resource cluster to which a neighbor file belongs; under the condition that a second target resource cluster and a third target resource cluster with the duty ratio difference smaller than a first preset value exist, acquiring a first average value of the distances between the newly added resource and the neighbor files belonging to the second target resource cluster, a second average value of the distances between the newly added resource and the neighbor files belonging to the third target resource cluster, a third average value of the distances between the neighbor files in the second target resource cluster and a fourth average value of the distances between the neighbor files in the third target resource cluster; if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, adding the newly added resource file and the neighbor file in the second target resource cluster into the third target resource cluster, wherein the ratio of the third target resource cluster to the second target resource cluster is higher, and the preset conditions comprise: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than the third average value, and the second average value is smaller than the fourth average value; if the first average value, the second average value, the third average value and the fourth average value do not meet the preset conditions, acquiring a resource cluster to which the centroid with the shortest distance to the newly added resource file belongs, adding the newly added resource file into the determined resource cluster, or regenerating a resource cluster for the newly added resource.
Specifically, if the k neighbor files do not belong to the same cluster, the duty ratio of the neighbor files in the resource clusters to which the k neighbor files belong is calculated, and if the neighbor files belong to m resource clusters, the duty ratio of the neighbor files in the resource clusters to which the k neighbor files belong is recorded as f 1 ,f 2 ,...,f m Wherein f 1 +f 2 +...+f m =1, and determines into which resource cluster the new resource is added according to the following conditions:
if there is |f i -f j I < f, i not equal to j, where f i And fj is the duty ratio of the neighbor file in the second target resource cluster i and the third target resource cluster j respectively, f is the first preset value, and the first preset value can be understood as the duty ratio threshold value of the resource cluster i and the resource cluster j, and the duty ratio of the neighbor file in the resource i and the resource cluster j is equivalent under the condition that the duty ratio difference is smaller than the threshold value. The average distance d_i_in between the new resource and the neighbor file in the resource cluster i (i.e., the first average value), the average distance d_j_in between the new resource and the neighbor file in the resource cluster j (i.e., the second average value), and the average distance d_i_mean between the neighbor files contained in the resource cluster i (i.e., the third average value) are calculated, and the average distance d_j_mean between the neighbor files contained in the resource cluster j (i.e., the fourth average value) are calculated.
a) If the preset condition |d_i_in-d_j_in| < d and d_i_in < d_i_mean, d_j_in < d_j_mean, d being the second preset value, the second preset value being understood as the threshold value of the difference between d_j_in and d_i_in, when the difference between d_j_in and d_i_in is smaller than the threshold value, the distance i between the new resource and the neighbor file of the resource cluster and the distance i between the new resource and the neighbor file of the resource cluster j are equal and smaller, and after combining the resource i and the resource cluster j, the new resource is divided into clusters after combining the resource i and the resource cluster j. As an alternative embodiment, neighbor files in a relatively low-occupancy resource cluster may be added to a relatively high-occupancy resource cluster along with new resources.
b) If the preset condition is not met, calculating the distance between the new resource and the centroid of the resource cluster where the neighbor file is located, determining the resource cluster with the shortest centroid distance, taking the resource cluster as an alternative cluster possibly added, and judging whether the new resource is added into the alternative cluster or the newly built resource cluster according to the condition that the neighbor file belongs to the same first target resource cluster (at the moment, the alternative cluster is regarded as the first target resource cluster).
As an alternative embodiment, each resource file has a corresponding file level, each resource cluster has a corresponding theme, and according to a received resource file extraction request, extracting resource files from at least one resource cluster to form a file package, and returning the file package, including: the resource file extracts the request information in the request, wherein the request information comprises at least one of the following: extracting topics, extracting file grades and extracting quantity corresponding to each file type; under the condition that the resource file extraction request comprises an extraction theme and an extraction file grade, screening out resource files conforming to the extraction file grade from the resource clusters which are the same as the extraction theme; when the extraction number is included in the resource file extraction request, the resource files corresponding to the extraction number are randomly extracted from the resource files conforming to the extraction file level to form a file package, and the file package is returned.
It should be noted that, each resource cluster includes different types of resources that are clustered together according to the content similarity, so when generating the file package, the resource files can be preferentially selected from one resource cluster, so as to ensure the connectivity of different resource files.
In the embodiment of generating the lesson preparation package for the teacher in the education field, the above-mentioned extraction subject may be keywords of questions, courseware and video, and the extraction file level may be the question difficulty, for example, the teacher inputs request information for generating the lesson preparation package, including the keywords of the questions, courseware and video, the number of files of the questions, courseware and video, and the question difficulty level, returns a plurality of related resource clusters according to the keywords and the number of files, and for each resource cluster, after filtering out the questions not meeting the difficulty level requirement, randomly selects the resource files of the specified number of files from the resource clusters to form the lesson preparation package.
As an alternative embodiment, in the case that the extraction number is not included in the resource file extraction request, determining the extraction number according to the historical extraction behavior of the extraction subject, or extracting the resource file according to a preset extraction number; extracting the resource files from the first N resource clusters with the number of the resource files ordered from high to low under the condition that the extraction subject is not included in the resource file extraction request; in the case where the extraction file rank is not included in the resource file extraction request, the resource files corresponding to the extraction number are randomly extracted from the same resource cluster as the extraction subject to form a file package.
It should be noted that, the request information may only include one or two of the extraction subject, the extraction file level, and the extraction number corresponding to each file type, for example, when the teacher inputs the request for generating the lesson preparation package, only inputs the keyword and the difficulty level of the subject, and may determine the number of resource files in the lesson preparation package according to the number of resource files in the lesson preparation package request input by the teacher in the past, or determine the number of resource files in the lesson preparation package according to the preset extraction number. In an alternative embodiment, m resource clusters with the most relevant content can be returned by matching the keyword input by the teacher with the keyword of the resource cluster, where m is the preset number of returned resource clusters. If the teacher does not input the keywords, the teacher can sort the resource clusters according to the popularity of the resource clusters, and return the m resource clusters with the top sort. And for each resource cluster, filtering out questions which do not meet the difficulty requirement, and randomly selecting a specified number of questions, courseware and videos to generate a lesson preparation package.
FIG. 2 is a flowchart of an alternative method for processing a file according to an embodiment of the present invention, as shown in FIG. 2, the method includes:
step S201, obtaining resource information; the resources processed include, but are not limited to, topics, courseware, video, etc., and the required topic information includes: question text content (stem, options, answers), difficulty, associated chapters. The required courseware information includes: courseware content (containing text for each page), associated chapters. The desired video information includes: video content (including subtitles or audio for each frame), associated chapters.
Step S202, constructing resource characteristics; the features of the resource are vectorized, in particular text features.
Step S203, clustering resources; after the vector representation of each resource is obtained, the vector can be used for clustering, for example, when the clustering range is set to be in a section, the resources under the same section can be clustered according to the section label of the resource, and the resources with similar contents are clustered together.
Step S204, generating a lesson preparation package.
FIG. 3 is a flowchart of an alternative method for constructing feature information of a resource file according to an embodiment of the present invention, as shown in FIG. 3, the method includes:
step S301, updating a stop word list; taking courseware as an example, extracting texts of the whole courseware in a resource library, then performing word segmentation, cleaning through a general stop word list, counting the frequency of each word after word segmentation, and screening new stop words from words with higher frequency so as to enlarge the stop word list in the field of education resources.
Step S302, word segmentation is carried out to stop words; and removing stop words after the word segmentation of each resource file.
Step S303, text vectorization; and vectorizing the word segmentation after the stop word is removed by using TF-IDF or word2 vec.
And S304, vector compression, namely performing element-wise scaling by using an activation function, and limiting each dimension value of the text vector to be between 0 and 1.
Through the steps, a stop word list applicable to topics, courseware and videos in the education field is constructed, and the feature extraction of various resources is carried out in the same space so as to represent the resources of different types. In the embodiment, the clustering mode is used for finding the resources with stronger correlation, the resource combination which is similar in content and suitable for matching is found more efficiently, and for a large number of new resources generated every day, the new resources are counted into a proper resource cluster by using the incremental clustering method, so that a teacher can be quickly helped to build a lesson preparation package meeting requirements, the time for the teacher to find the resources and match different types of resources is reduced, and the teaching efficiency is improved.
Example 2
According to an embodiment of the present application, an embodiment of a document processing apparatus is provided, and fig. 4 is a schematic diagram of a document processing apparatus according to an embodiment of the present invention, as shown in fig. 4, including: an obtaining module 41, configured to obtain a plurality of resource files, and construct feature information of each resource file; a clustering module 42, configured to cluster a plurality of resource files based on the feature information of each resource file, and generate a plurality of resource clusters; the construction module 43 is configured to extract a resource file from at least one resource cluster to construct a file package according to the received resource file extraction request, and return the file package.
As an alternative embodiment, the acquiring module includes: the first word segmentation sub-module is used for acquiring text information in the resource file and segmenting the text information; the cleaning submodule is used for cleaning the word segmentation result by stopping the word list; and the vectorization sub-module is used for carrying out text vectorization processing based on the cleaning result to obtain a text vector used for representing the characteristic information.
As an optional embodiment, in a case where the resource file is a video file, the obtaining module includes: the subtitle extraction module is used for acquiring subtitle data to obtain text information in the video file when the video file comprises the subtitle data; and the voice conversion sub-module is used for extracting voice information in the video file and converting the voice information into text information in the case that the video file does not comprise subtitle data.
As an alternative embodiment, the above device further comprises: the stop word list creation sub-module is configured to create a stop word list corresponding to a file type of the resource file, where creating the stop word list corresponding to the file type of the resource file includes: the second word segmentation sub-module is used for segmenting the full resource files in the resource library, wherein the resource library comprises a plurality of types of resource files; the screening sub-module is used for screening out stop words corresponding to each type of resource file from word segmentation results of the full resource files, wherein the stop words corresponding to each type of resource file are determined according to the occurrence frequency of each stop word in each type of resource file; and the stop word list generation sub-module is used for generating a stop word list corresponding to the file type according to the stop words corresponding to each type of resource file.
As an optional embodiment, the above cleaning submodule is further configured to clean the word segmentation result through a deactivated vocabulary corresponding to a file type of the resource file.
As an alternative embodiment, the apparatus further comprises one or more of the following: the scaling sub-module is used for scaling the text vector through the activation function; and the dimension reduction sub-module is used for carrying out dimension reduction processing on the text vector.
As an optional embodiment, the clustering module is further configured to cluster, by using a K-means clustering algorithm, a plurality of resource files based on the feature information of each resource file, so as to generate a plurality of resource clusters.
As an alternative embodiment, the apparatus further includes: the first newly added sub-module is used for receiving newly added resource files and constructing characteristic information of the newly added resource files; the neighbor determination submodule is used for determining neighbor files of the newly added resource files according to the characteristic information of the newly added resource files and the characteristic information of the existing resource files; the first dividing sub-module is used for dividing the newly added resource into any one of a plurality of resource clusters according to the distance relation between the newly added resource file and the adjacent file, or regenerating a resource cluster for the newly added resource.
As an optional embodiment, in a case where the neighboring files all belong to the same first target resource cluster, the first dividing sub-module further includes: the first distance acquisition sub-module is used for acquiring a first distance between the newly added resource file and the mass center of the first target resource cluster; the second distance acquisition sub-module is used for acquiring a second distance between a resource file farthest from the mass center in the first target resource cluster and the mass center; the average distance acquisition sub-module is used for acquiring the average distance between all the resource files in the first target resource cluster and the mass center; the second dividing sub-module is used for dividing the newly added resources into a first target resource cluster under the condition that the difference between the first distance and the second distance is smaller than or equal to the average distance; and the second newly added sub-module is used for regenerating a resource cluster for the newly added resource under the condition that the difference between the first distance and the second distance is larger than the average distance.
As an optional embodiment, in a case where the neighboring files do not belong to the same resource cluster, the first dividing sub-module further includes: the duty ratio acquisition sub-module is used for acquiring the duty ratio of the resource cluster to which the neighbor file belongs; the average value obtaining submodule is used for obtaining a first average value of the distance between the newly added resource and the neighbor file belonging to the second target resource cluster, a second average value of the distance between the newly added resource and the neighbor file belonging to the third target resource cluster, a third average value of the distance between the neighbor file in the second target resource cluster and a fourth average value of the distance between the neighbor file in the third target resource cluster under the condition that the second target resource cluster and the third target resource cluster with the duty ratio difference smaller than the first preset value exist; the first adding submodule is configured to add the newly added resource file and the neighbor file in the second target resource cluster to the third target resource cluster if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, where the ratio of the third target resource cluster to the second target resource cluster is higher, and the preset conditions include: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than the third average value, and the second average value is smaller than the fourth average value; and the second adding sub-module is used for acquiring the resource cluster to which the centroid with the shortest distance with the newly added resource file belongs if the first average value, the second average value, the third average value and the fourth average value do not meet the preset condition, adding the newly added resource file into the determined resource cluster, or regenerating a resource cluster for the newly added resource.
As an alternative embodiment, each resource file has a corresponding file level, each resource cluster has a corresponding theme, and the above-mentioned forming module includes: the extraction submodule is used for extracting the request information in the request of the resource file, wherein the request information comprises at least one of the following: extracting topics, extracting file grades and extracting quantity corresponding to each file type; the first selecting sub-module is used for screening out the resource files conforming to the extraction file grade from the resource clusters which are the same as the extraction subject under the condition that the extraction subject and the extraction file grade are included in the resource file extraction request; and the first selection submodule is used for randomly extracting resource files corresponding to the extraction quantity from the resource files conforming to the extraction file grade to form a file package and returning the file package when the extraction quantity is included in the resource file extraction request.
As an alternative embodiment, the above-mentioned constituent module further includes: a third selection sub-module, configured to determine an extraction amount according to a historical extraction behavior of the extraction subject, or extract a resource file according to a preset extraction amount, when the extraction amount is not included in the resource file extraction request; a fourth selecting sub-module, configured to extract a resource file from the first N resource clusters of which the number of resource files is ordered from high to low, where the resource file extraction request does not include an extraction subject; in the case where the extraction file rank is not included in the resource file extraction request, the resource files corresponding to the extraction number are randomly extracted from the same resource cluster as the extraction subject to form a file package.
In the embodiment, the resource clusters are obtained by constructing the characteristic information of the resource files and clustering a plurality of resource files, and the combination of a plurality of resources can be obtained from the resource clusters according to the resource file extraction request to form the file package. The method for combining the resource files can be used for generating a lesson preparation package for teaching, can generate a multi-resource combination with similar content and suitable for matched use by constructing text features of the resource files in the education field and correspondingly clustering, helps a teacher to quickly construct the lesson preparation package meeting requirements, reduces the time for the teacher to find resources and match different types of resources, solves the technical problem that the teacher cannot accurately find proper teaching resources due to the recommending method of single teaching resources in the prior art, and improves teaching efficiency.
Example 3
According to an embodiment of the present application, there is provided an embodiment of a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of the above. The method comprises the steps of constructing characteristic information of resource files, clustering a plurality of resource files to obtain resource clusters, and obtaining a plurality of resource combinations from the resource clusters to form a file package according to a resource file extraction request. The method for combining the resource files can be used for generating a lesson preparation package for teaching, can generate a multi-resource combination with similar content and suitable for matched use by constructing text features of the resource files in the education field and correspondingly clustering, helps a teacher to quickly construct the lesson preparation package meeting requirements, reduces the time for the teacher to find resources and match different types of resources, solves the technical problem that the teacher cannot accurately find proper teaching resources due to the recommending method of single teaching resources in the prior art, and improves teaching efficiency.
Example 4
According to an embodiment of the present application, there is provided an intelligent interaction tablet, including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of embodiment 1.
Fig. 5 is a schematic structural diagram of an intelligent interaction tablet provided in an embodiment of the present application, where the intelligent interaction tablet includes the interaction device main body and a touch frame, and in conjunction with fig. 5, the intelligent interaction tablet 1000 may include: at least one processor 1001, at least one network interface 1004, a user interface 1003, a memory 1005, at least one communication bus 1002.
Wherein the communication bus 1002 is used to enable connected communication between these components.
The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may further include a standard wired interface and a wireless interface.
The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
Wherein the processor 1001 may include one or more processing cores. Processor 1001 interfaces and lines with various portions of the overall intelligent interaction tablet 1000 by executing or executing instructions, programs, code sets, or instruction sets stored in memory 1005 and invoking data stored in memory 1005 to perform various functions for intelligent interaction tablet 1000 and to process data. Alternatively, the processor 1001 may be implemented in at least one hardware form of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 1001 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 1001 and may be implemented by a single chip.
The Memory 1005 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). The memory 1005 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described respective method embodiments, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in FIG. 5, an operating system, a network communication module, a user interface module, and an operating application of the intelligent interactive tablet may be included in memory 1005, which is a type of computer storage medium.
In the intelligent interactive tablet 1000 shown in fig. 5, the user interface 1003 is mainly used for providing an input interface for a user, and acquiring data input by the user; and the processor 1001 may be used to invoke the intelligent interactive tablet operating application program stored in the memory 1005 and specifically perform any of the operations of embodiment 1.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (13)

1. A method for processing a document, comprising:
acquiring a plurality of resource files and constructing characteristic information of each resource file;
clustering the plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters;
extracting resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package;
clustering the plurality of resource files based on the characteristic information of each resource file, generating a plurality of resource clusters, receiving newly added resource files, and constructing the characteristic information of the newly added resource files; determining neighbor files of the newly added resource file according to the characteristic information of the newly added resource file and the characteristic information of the existing resource file; dividing the newly added resource into any one of the resource clusters according to the distance relation between the newly added resource file and the neighbor file, or regenerating a resource cluster for the newly added resource;
And dividing the newly added resource into any one of the resource clusters or regenerating a resource cluster for the newly added resource according to the distance relation between the newly added resource file and the neighboring file under the condition that the neighboring file does not belong to the same resource cluster, wherein the method comprises the following steps: acquiring the duty ratio of the resource cluster to which the neighbor file belongs; acquiring a first average value of the distance between the newly added resource and a neighbor file belonging to the second target resource cluster, a second average value of the distance between the newly added resource and a neighbor file belonging to the third target resource cluster, a third average value of the distance between the neighbor file in the second target resource cluster and a fourth average value of the distance between the neighbor file in the third target resource cluster when a second target resource cluster and a third target resource cluster with a ratio difference smaller than a first preset value exist; if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, adding the newly added resource file and the neighbor file in the second target resource cluster into a third target resource cluster, wherein the third target resource cluster has a higher ratio than the second target resource cluster, and the preset conditions include: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than the third average value, and the second average value is smaller than the fourth average value; and if the first average value, the second average value, the third average value and the fourth average value do not meet preset conditions, acquiring a resource cluster to which a centroid with the shortest distance from the newly added resource file belongs, adding the newly added resource file into the determined resource cluster, or regenerating a resource cluster for the newly added resource.
2. The method of claim 1, wherein obtaining a plurality of resource files and constructing the characteristic information of each of the resource files comprises:
acquiring text information in the resource file, and segmenting the text information;
cleaning word segmentation results by stopping the word list;
and performing text vectorization processing based on the cleaning result to obtain a text vector used for representing the characteristic information.
3. The method according to claim 2, wherein, in the case where the resource file is a video file, obtaining text information in the resource file includes:
acquiring caption data to obtain text information in the video file under the condition that the video file comprises the caption data;
and extracting voice information in the video file and converting the voice information into text information when the video file does not include the subtitle data.
4. The method according to claim 2, wherein the method further comprises: creating a deactivation vocabulary corresponding to a file type of the resource file, wherein creating the deactivation vocabulary corresponding to the file type of the resource file comprises:
Word segmentation is carried out on the full resource files in a resource library, wherein the resource library comprises a plurality of types of resource files;
screening out stop words corresponding to each type of resource files from word segmentation results of the full resource files, wherein the stop words corresponding to each type of resource files are determined according to the occurrence frequency of each stop word in each type of resource files;
and generating a stop word list corresponding to the file type according to the stop words corresponding to each type of resource file.
5. The method of claim 4, wherein the cleaning of the word segmentation result by disabling the vocabulary comprises: and cleaning the word segmentation result through the stop word list corresponding to the file type of the resource file.
6. The method of claim 2, wherein after performing text vectorization processing based on the cleaning result to obtain a text vector for representing the feature information, the method further comprises one or more of:
scaling the text vector by activating a function;
and performing dimension reduction processing on the text vector.
7. The method of claim 1, wherein clustering the plurality of resource files based on the characteristic information of each of the resource files generates a plurality of resource clusters, comprising:
And clustering a plurality of resource files based on the characteristic information of each resource file through a K-means clustering algorithm to generate a plurality of resource clusters.
8. The method according to claim 1, wherein, in the case that the neighbor files all belong to the same first target resource cluster, dividing the newly added resource into any one of the plurality of resource clusters or regenerating one resource cluster for the newly added resource according to a distance relationship between the newly added resource file and the neighbor file comprises:
acquiring a first distance between the newly added resource file and the mass center of the first target resource cluster;
acquiring a second distance between a resource file farthest from the centroid and the centroid in the first target resource cluster;
obtaining the average distance between all resource files in the first target resource cluster and the centroid;
dividing the newly added resource into the first target resource cluster under the condition that the difference between the first distance and the second distance is smaller than or equal to the average distance;
and regenerating a resource cluster for the newly added resource under the condition that the difference between the first distance and the second distance is larger than the average distance.
9. The method of claim 1, wherein each resource file has a corresponding file rank, each resource cluster has a corresponding topic, extracting resource files from at least one of the resource clusters to form a package according to a received resource file extraction request, and returning the package, comprising:
the resource file extracts request information in the request, wherein the request information comprises at least one of the following: extracting topics, extracting file grades and extracting quantity corresponding to each file type;
under the condition that the resource file extraction request comprises the extraction theme and the extraction file grade, screening out resource files conforming to the extraction file grade from the resource clusters which are the same as the extraction theme;
and randomly extracting resource files corresponding to the extraction quantity from the resource files conforming to the extraction file grade to form the file package and returning the file package under the condition that the extraction quantity is included in the resource file extraction request.
10. The method of claim 9, wherein the step of determining the position of the substrate comprises,
under the condition that the extraction quantity is not included in the resource file extraction request, determining the extraction quantity according to the historical extraction behavior of an extraction main body, or extracting the resource file according to the preset extraction quantity;
Extracting resource files from the first N resource clusters with the number of the resource files ordered from high to low under the condition that the extraction subject is not included in the resource file extraction request;
and randomly extracting resource files corresponding to the extraction quantity from the resource clusters which are the same as the extraction subject to form the file package under the condition that the extraction file grade is not included in the resource file extraction request.
11. A document processing apparatus, comprising:
the acquisition module is used for acquiring a plurality of resource files and constructing characteristic information of each resource file;
the clustering module is used for clustering the plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters; clustering the plurality of resource files based on the characteristic information of each resource file, generating a plurality of resource clusters, receiving newly added resource files, and constructing the characteristic information of the newly added resource files; determining neighbor files of the newly added resource file according to the characteristic information of the newly added resource file and the characteristic information of the existing resource file; dividing the newly added resource into any one of the plurality of resource clusters or regenerating a resource cluster for the newly added resource according to the distance relation between the newly added resource file and the neighbor file, wherein when the neighbor file does not belong to the same resource cluster, dividing the newly added resource into any one of the plurality of resource clusters or regenerating a resource cluster for the newly added resource according to the distance relation between the newly added resource file and the neighbor file comprises: acquiring the duty ratio of the resource cluster to which the neighbor file belongs; acquiring a first average value of the distance between the newly added resource and a neighbor file belonging to the second target resource cluster, a second average value of the distance between the newly added resource and a neighbor file belonging to the third target resource cluster, a third average value of the distance between the neighbor file in the second target resource cluster and a fourth average value of the distance between the neighbor file in the third target resource cluster when a second target resource cluster and a third target resource cluster with a ratio difference smaller than a first preset value exist; if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, adding the newly added resource file and the neighbor file in the second target resource cluster into a third target resource cluster, wherein the third target resource cluster has a higher ratio than the second target resource cluster, and the preset conditions include: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than the third average value, and the second average value is smaller than the fourth average value; if the first average value, the second average value, the third average value and the fourth average value do not meet preset conditions, acquiring a resource cluster to which a centroid with the shortest distance from the newly added resource file belongs, adding the newly added resource file into the determined resource cluster, or regenerating a resource cluster for the newly added resource;
And the composition module is used for extracting resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package.
12. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any of claims 1 to 10.
13. An intelligent interactive tablet, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 10.
CN202011602808.9A 2020-12-29 2020-12-29 File processing method and device Active CN112732867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011602808.9A CN112732867B (en) 2020-12-29 2020-12-29 File processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011602808.9A CN112732867B (en) 2020-12-29 2020-12-29 File processing method and device

Publications (2)

Publication Number Publication Date
CN112732867A CN112732867A (en) 2021-04-30
CN112732867B true CN112732867B (en) 2024-03-15

Family

ID=75610513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011602808.9A Active CN112732867B (en) 2020-12-29 2020-12-29 File processing method and device

Country Status (1)

Country Link
CN (1) CN112732867B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system
CN108647244A (en) * 2018-04-13 2018-10-12 广东技术师范学院 The tutorial resources integration method of mind map form, network store system
CN109299315A (en) * 2018-09-03 2019-02-01 腾讯科技(深圳)有限公司 Multimedia resource classification method, device, computer equipment and storage medium
CN110929161A (en) * 2019-12-02 2020-03-27 南京莱斯网信技术研究院有限公司 Large-scale user-oriented personalized teaching resource recommendation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system
CN108647244A (en) * 2018-04-13 2018-10-12 广东技术师范学院 The tutorial resources integration method of mind map form, network store system
CN109299315A (en) * 2018-09-03 2019-02-01 腾讯科技(深圳)有限公司 Multimedia resource classification method, device, computer equipment and storage medium
CN110929161A (en) * 2019-12-02 2020-03-27 南京莱斯网信技术研究院有限公司 Large-scale user-oriented personalized teaching resource recommendation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于模糊聚类的教学资源自适应推荐研究;黎孟雄 等;中国远程教育(第7期);89-92 *

Also Published As

Publication number Publication date
CN112732867A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
CN108509465B (en) Video data recommendation method and device and server
CN112015949A (en) Video generation method and device, storage medium and electronic equipment
CN106570708A (en) Management method and management system of intelligent customer service knowledge base
CN110234018B (en) Multimedia content description generation method, training method, device, equipment and medium
CN109344298A (en) A kind of method and device converting unstructured data to structural data
WO2020103899A1 (en) Method for generating inforgraphic information and method for generating image database
CN111046194A (en) Method for constructing multi-mode teaching knowledge graph
KR20200087977A (en) Multimodal ducument summary system and method
CN114827752B (en) Video generation method, video generation system, electronic device and storage medium
CN115131698A (en) Video attribute determination method, device, equipment and storage medium
CN112231554A (en) Search recommendation word generation method and device, storage medium and computer equipment
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
Ruta et al. StyleBabel: artistic style tagging and captioning
CN114845149B (en) Video clip method, video recommendation method, device, equipment and medium
CN116051192A (en) Method and device for processing data
CN110297965B (en) Courseware page display and page set construction method, device, equipment and medium
CN112732867B (en) File processing method and device
Ceusters et al. Switching Partners: Dancing with the Ontological Engineers
CN115580758A (en) Video content generation method and device, electronic equipment and storage medium
CN112333554B (en) Multimedia data processing method and device, electronic equipment and storage medium
CN116306506A (en) Intelligent mail template method based on content identification
CN114297372A (en) Personalized note generation method and system
CN112417295A (en) Education cloud information pushing method, storage medium and system
JP6900334B2 (en) Video output device, video output method and video output program
CN115130453A (en) Interactive information generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant