CN114186557A - Method, device and storage medium for determining subject term - Google Patents

Method, device and storage medium for determining subject term Download PDF

Info

Publication number
CN114186557A
CN114186557A CN202210143658.2A CN202210143658A CN114186557A CN 114186557 A CN114186557 A CN 114186557A CN 202210143658 A CN202210143658 A CN 202210143658A CN 114186557 A CN114186557 A CN 114186557A
Authority
CN
China
Prior art keywords
phrase
text
degree
phrases
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210143658.2A
Other languages
Chinese (zh)
Inventor
邓憧
王雯
索宏彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Institute Hangzhou Technology Co Ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202210143658.2A priority Critical patent/CN114186557A/en
Publication of CN114186557A publication Critical patent/CN114186557A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a method, equipment and a storage medium for determining a subject term, relates to the technical field of data processing, and particularly relates to the technical field of text processing. The method comprises the following steps: acquiring a plurality of phrases in a text to be processed, wherein each phrase comprises at least one word segmentation; calculating the degree of aggregation of each phrase and the degree of freedom among the phrases, wherein the degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom among the phrases is used for representing the fixed degree of one phrase and the adjacent phrases; and determining the subject term of the text to be processed according to the degree of aggregation of each phrase and the degree of freedom among the plurality of phrases, so that the automatic extraction of the subject term of the text is realized, and the integrity and the accuracy of the extracted subject term are improved.

Description

Method, device and storage medium for determining subject term
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, a device, and a storage medium for determining a subject term.
Background
With the rapid development of the internet and the complete function of the on-line cooperative software, the on-line processing mode of the audio and video is favored by more and more users with the advantage of high convenience.
For online audio and video processing scenes, such as an online audio conference, an online video conference, online education, audio and video work processing and the like, taking the online conference as an example, conference software generally has functions of storing conference audio and converting the conference audio into text data, so that a user can conveniently summarize and review the conference through the conference audio or the text data after the conference is finished, such as writing a conference summary, so as to better understand conference contents. The online conference has a long time, so that the converted text data has a long spread, and therefore, a user needs to spend a long time for refining the conference content, and the user experience is poor.
Disclosure of Invention
The application provides a method, equipment and storage medium for determining subject terms, which realize automatic extraction of text subject terms and improve the efficiency of text content extraction.
In a first aspect, the present application provides a method for determining a topic word, including:
acquiring a plurality of phrases in a text to be processed, wherein each phrase comprises at least one word segmentation; calculating a degree of aggregation of each phrase and a degree of freedom among the phrases, wherein the degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom is used for representing the fixed degree of one phrase and the adjacent phrases; and determining the subject term of the text to be processed according to the degree of aggregation of each phrase and the degree of freedom among the plurality of phrases.
In a second aspect, the present application provides another topic word determination method, including:
acquiring audio data of a conference collected by a conference system, and generating a text to be processed according to the audio data; acquiring a plurality of phrases in the text to be processed, wherein each phrase comprises at least one word segmentation; calculating a degree of aggregation of each phrase and a degree of freedom among the phrases, wherein the degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom is used for representing the fixed degree of one phrase and the adjacent phrases; and determining subject words of the conference according to the degree of aggregation of each phrase and the degrees of freedom among the plurality of phrases.
In a third aspect, the present application provides a topic word determination apparatus, comprising:
the phrase acquisition module is used for acquiring a plurality of phrases in the text to be processed, wherein each phrase comprises at least one word segmentation; the phrase parameter calculation module is used for calculating the degree of aggregation of each phrase and the degree of freedom among the phrases, wherein the degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom is used for representing the fixed degree of one phrase and the adjacent phrases; and the subject word determining module is used for determining the subject words of the text to be processed according to the degree of aggregation of each phrase and the degree of freedom among the phrases.
In a fourth aspect, the present application provides a topic word determination apparatus comprising:
a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes the computer-executable instructions stored in the memory to implement the subject term determination method provided in the first aspect or the second aspect of the present application.
In a fifth aspect, the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is used for implementing the subject term determination method provided in the first aspect or the second aspect of the present application.
In a sixth aspect, the present application provides a computer program product comprising a computer program that, when executed by a processor, implements the subject term determination method provided in the first or second aspect of the present application.
According to the subject term determining method, the device, the storage medium and the program product, the text to be processed is subjected to phrase division to obtain each phrase in the text to be processed, the phrases of the text to be processed are processed based on the aggregation degree and the freedom degree of the phrases to obtain the subject terms of the text to be processed, the rapid extraction of the subject terms of the text is achieved, convenience is brought to the extraction of the content of the text to be processed, the extraction of the subject terms is carried out by taking the phrases as units, the integrity of the extraction of the subject terms is improved, a user can rapidly determine the main content of the text to be processed based on the extracted subject terms, and the efficiency of the extraction of the content is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for determining a topic word according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of step S203 in the embodiment of FIG. 2;
fig. 4 is a flowchart illustrating a method for determining a topic word according to another embodiment of the present application;
FIG. 5 is a schematic diagram of a weighted directed graph according to the embodiment shown in FIG. 4 of the present application;
FIG. 6 is a diagram illustrating candidate phrase topic word determination in the embodiment of FIG. 4 according to the present disclosure;
fig. 7 is a flowchart illustrating a method for determining a topic word according to another embodiment of the present application;
fig. 8 is a schematic structural diagram of a topic determination apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a topic word determination apparatus according to an embodiment of the present application.
With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The method and the device can be used for extracting the subject terms of the text, and particularly can be used for extracting the subject terms of the spoken text. Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application, as shown in fig. 1, in an online conference, one or more participating users may participate in the online conference through conference software installed on user terminals 102, and each user terminal 102 communicates with a server 104 through a network, so as to implement communication of conference audio data of the online conference. In fig. 1, 3 user terminals 102 are taken as an example, and a plurality of participating users may share one user terminal 102.
In some techniques, the participating users may store conference audio data in the corresponding user terminals 102 through the associated functionality of the conference software to facilitate review and summarization of conference content after the conference is completed.
In some technologies, the conferencing software or server 104 also provides functionality to convert conference audio data to text. Or the user may convert the conference audio data to text data via other audio recognition tools. Thereby quickly reviewing the content of the meeting or refining the main content of the meeting based on the text data.
In other online audio-video scenes, such as voice memos, audio-visual works, online education, etc., often only the function of converting audio into text data such as subtitles by audio recognition is provided.
In the various audio and video scenes, the text data converted is too long due to long audio and video duration, and a user cannot quickly master the main content of the audio and video based on the text data converted by the audio, that is, the user needs to spend a long time to read and refine the text, and the user experience is poor.
In some technologies, a subject word extraction method of text data is provided, in which a model relied on is a written language model, the accuracy of extracting subject words of a scene of a spoken text converted by audio is poor, and most of existing subject word extraction methods rely only on the attributes of participles, such as the positions of the participles, the word frequency, and the like, which easily results in the incomplete extracted subject words.
The subject term determination method provided by the application aims to solve the technical problems in the prior art. The main concept of the subject term determination method is as follows: the method comprises the steps of dividing a text to be processed, such as a spoken text, into a plurality of phrases, wherein the phrases can be composed of one or more participles, and determining subject words of the text to be processed based on phrase attributes of the plurality of phrases, including aggregation and freedom, so that automatic extraction of the subject words is realized, and the completeness of the subject word extraction is ensured.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 2 is a flowchart illustrating a method for determining a topic word according to an embodiment of the present application. The method provided by this embodiment may be applied to extract the subject term of the text of the audio/video data, such as the subject term extraction of the text of the voice data conversion in the online meeting scene, the offline meeting scene, the court trial scene, the online education, the movie and television works, and other scenes shown in fig. 1, and the method may be executed by any device having a data processing function, such as the server 104, the user terminal 102, or a subject term extraction device in the subsequent embodiments in fig. 1.
As shown in fig. 2, the subject term determination method includes:
step S201, a plurality of phrases in the text to be processed are acquired.
One phrase may include one or more participles.
In one embodiment, the text to be processed may be text data of audio-video data conversion, such as spoken text.
For example, the text to be processed may be text data converted from audio data output by a teacher during teaching in an online education scene, text data converted from audio data of a conference recorded in an online or offline conference scene, text data converted from a voice memo, or converted text data from audio and video works, such as talk show, movie, interview, and the like.
In one embodiment, the text to be processed may be written text of an article, news, instructions, and the like.
In one embodiment, after the text to be processed is obtained, word segmentation processing may be performed on the text to be processed, and phrases of the text to be processed are obtained by integrating the word segmentation of the text to be processed.
In one embodiment, word segmentation processing may be performed on the text to be processed based on the word segmenter to obtain each clause and each word segmentation of the text to be processed, and operations such as part-of-speech tagging and word segmentation normalization may be performed on each word segmentation.
In one embodiment, for each clause, the clauses of the clause are combined according to the occurrence frequency of each clause in the clause, the occurrence frequency of adjacent words of each clause and other parameters, so as to obtain one or more phrases corresponding to the clause.
In one embodiment, the language detection may be performed on the text to be processed to obtain the language adopted by the text to be processed, and the word segmentation device matched with the language is selected to perform word segmentation processing on the text to be processed.
In one embodiment, the word segmentation device may perform word segmentation on the text to be processed based on a pre-stored dictionary to obtain each word segmentation of the text to be processed.
In one embodiment, the domain dictionary corresponding to the customer can be generated through the linguistic data and the training linguistic data uploaded by the user.
In one embodiment, the phrases of the text to be processed may be obtained based on an n-gram model, such as a bigram model, a trigram model, or the like. The method can perform word segmentation and part-of-speech tagging on the text to be processed based on the n-element model.
Illustratively, taking a clause "this is a safe and reliable scheme" in the text to be processed as an example, the clause may be split into five clauses of "this is", "one", "safe", "reliable" and "scheme" by a word splitter or an n-gram model, and the clause may be split into two phrases of "this is a" and "safe and reliable scheme" by full text analysis of the text to be processed.
In one embodiment, after obtaining each participle of the text to be processed, the participle of the text to be processed may be further filtered to delete a part of the participle.
In one embodiment, the segments of the text to be processed may be filtered based on the stop word table and the filter word table to delete the stop words and the filter words in the segments of the text to be processed. The stop words are words in the stop word list, and the filter words are words in the filter word list. The filter vocabulary may be generated based on the n-gram and the corpus.
In one embodiment, the word segmentation of the text to be processed may also be filtered based on the part of speech and the word length of the word segmentation. If the word segmentation of the part of speech such as the mood assistant word and the interjective word in the word segmentation can be deleted through filtering, the word segmentation with too short or too long word length can also be deleted.
Step S202, aiming at the phrases, calculating the degree of cohesion of each phrase and the degree of freedom among the phrases.
The degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom is used for representing the fixed degree of one phrase and the adjacent phrases. The adjacent phrases are participles or other phrases adjacent to the phrases in the text to be processed, and may include left adjacent phrases and right adjacent phrases. The degree of aggregation, also known as degree of aggregation or degree of coagulation, is used to describe the degree of association between participles or motifs included in a phrase, and is an internal attribute of the phrase. The degree of freedom is used for describing the fixed degree of adjacent phrases of a phrase and is an external attribute of the phrase. The less the degree of freedom of a phrase is fixed by the proximity of its neighboring phrases, the more degrees of freedom of a phrase may be understood as the degree of freedom of that phrase relative to one or more other phrases, or as the degree of freedom between that phrase and one or more other phrases.
Specifically, for each phrase in the text to be processed, the degree of aggregation of the phrase may be calculated, and the degree of freedom between the phrase and other phrases, which is simply referred to as the degree of freedom of the phrase, may be calculated. In one embodiment, the degrees of freedom between the plurality of phrases are calculated, and may instead be calculated for each phrase, or calculated relative to other phrases.
In one embodiment, the higher the number or probability that the individual participles making up a phrase are occurring simultaneously, the higher the degree of aggregation of the phrase. The more the kinds of adjacent phrases of a phrase are, the higher the degree of freedom of the phrase is.
In one embodiment, the degree of aggregation and the degree of freedom of each phrase may be calculated based on the frequency of occurrence of each participle in the text to be processed.
Step S203, determining the subject term of the text to be processed according to the degree of aggregation of each phrase and the degree of freedom among the plurality of phrases.
Wherein a subject word may be composed of one or more phrases.
In one embodiment, the phrases of the text to be processed may be filtered, combined, and the like based on the aggregation degree and the degree of freedom, so as to obtain the subject terms of the text to be processed.
In one embodiment, after the subject term of the text to be processed is obtained, compliance check may be performed on the subject term based on a blacklist, a sensitive word list, and the like, so as to delete each subject term that is not compliant, and output a final result.
The topic word extraction method provided by this embodiment performs phrase division on a to-be-processed text to obtain each phrase in the to-be-processed text, processes the phrases of the to-be-processed text based on the aggregation degree and the degree of freedom of the phrase to obtain the topic words of the to-be-processed text, realizes rapid extraction of the topic words of the text, provides convenience for refining the content of the to-be-processed text, performs topic word extraction by using the phrase as a unit, and improves the integrity of the topic word extraction.
Optionally, calculating the degree of aggregation of each phrase and the degrees of freedom between the plurality of phrases comprises:
for an Nth phrase, calculating the degree of cohesion of the Nth phrase according to the frequency of occurrence of the Nth phrase in the text to be processed and the frequency of occurrence of each participle forming the Nth phrase in the text to be processed; calculating the degree of freedom of the Nth phrase according to the occurrence frequency of the Nth phrase in the text to be processed and the occurrence frequency of each spliced phrase of the Nth phrase in the text to be processed, wherein the spliced phrase of the Nth phrase comprises: the Nth phrase and the phrases adjacent to the Nth phrase, N is a positive integer and is less than or equal to the number of the phrases in the text to be processed.
In one embodiment, the degree of aggregation of the nth phrase may be determined by a ratio of the frequency of occurrence of the nth phrase in the text to be processed to the sum of the frequency of occurrence of the individual participles that make up the nth phrase in the text to be processed.
In one embodiment, the degree of aggregation of the nth phrase may be calculated according to the occurrence probability of the nth phrase in the text to be processed and the occurrence probability of each participle forming the nth phrase in the text to be processed, and the occurrence probability of the phrase or the participle may be a ratio of the occurrence frequency of the phrase or the participle to the sum of the occurrence frequencies of each participle of the text to be processed.
Illustratively, the degree of aggregation of a phrase may be calculated based on the following expression:
Figure 961586DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE003
as a phrase
Figure 388019DEST_PATH_IMAGE004
Degree of agglomeration of (a);
Figure DEST_PATH_IMAGE005
representing a probability of occurrence; phrase
Figure 797135DEST_PATH_IMAGE004
Is composed of n participles;
Figure 789362DEST_PATH_IMAGE006
as a phrase
Figure 688048DEST_PATH_IMAGE004
The ith word segmentation.
In one embodiment, in order to avoid the value of the degree of aggregation being too large, a logarithm operation may be performed on the degree of aggregation, and subsequent operations may be performed based on the logarithm of the degree of aggregation.
In one embodiment, the stitched phrases may include a left stitched phrase consisting of a phrase and its left neighboring phrase and a right stitched phrase consisting of a phrase and its right neighboring phrase. The degrees of freedom of a phrase (e.g., the nth phrase) also correspondingly include a left degree of freedom and a right degree of freedom, the left degree of freedom being used to describe a fixed degree of the phrase and its left neighboring phrase, which may be determined by the frequency of occurrence of the phrase in the text to be processed and the frequency of occurrence of each left-stitched phrase of the phrase in the text to be processed. The right degree of freedom is used for describing the fixed degree of the phrase and the adjacent phrase at the right side of the phrase, and can be determined by the occurrence frequency of the phrase in the text to be processed and the occurrence frequency of each right-spliced phrase of the phrase in the text to be processed.
It should be understood that the occurrence probability or occurrence frequency referred to in the embodiments of the present application is generally based on the text to be processed, that is, the occurrence frequency or occurrence probability of the phrase or the participle is generally the occurrence frequency or occurrence frequency of the phrase or the participle in the text to be processed.
For example, taking "in time after sun repair is more important", the adjacent phrase or adjacent word of the phrase "in time repair" includes "after sun," and "appears" or "appears more important", wherein "after sun" is the left adjacent phrase of the phrase "in time repair", and "appears" or "appears more important" is the right adjacent phrase of the phrase "in time repair".
In one embodiment, the left degree of freedom of the phrase may be determined according to a ratio of the frequency of occurrence of the phrase in the text to be processed to a sum of the frequency of occurrence of each left-stitched phrase of the phrase in the text to be processed. Correspondingly, the right degree of freedom of the phrase can be determined according to the ratio of the occurrence frequency of the phrase in the text to be processed to the sum of the occurrence frequencies of the right splicing phrases of the phrase in the text to be processed.
In one embodiment, the lesser of the left degree of freedom or the right degree of freedom of the phrase may be determined to be the degree of freedom of the phrase.
Optionally, calculating the degree of freedom of the nth phrase according to the frequency of occurrence of the nth phrase in the text to be processed and the frequency of occurrence of each spliced phrase of the nth phrase in the text to be processed, including:
respectively calculating first probabilities of the spliced phrases appearing under the condition that the Nth phrase appears according to the frequency of appearance of the Nth phrase in the text to be processed and the frequency of appearance of each spliced phrase of the Nth phrase in the text to be processed; and determining the degree of freedom of the Nth phrase according to the information entropy of the first probability of each spliced phrase.
The first probability of the phrase (e.g., nth phrase) appearing is a conditional probability, and the specific expression is:
Figure DEST_PATH_IMAGE007
wherein, in the step (A),
Figure 70619DEST_PATH_IMAGE008
j is the j adjacent phrase of the phrase, j is a positive integer less than or equal to m, m is the total number of adjacent phrases of the phrase in the text to be processed,
Figure DEST_PATH_IMAGE009
is composed of adjacent phrases
Figure 752267DEST_PATH_IMAGE008
And the first probability of a concatenated phrase consisting of the phrase,
Figure 864579DEST_PATH_IMAGE010
for the frequency of occurrence of the stitched phrase in the text to be processed,
Figure DEST_PATH_IMAGE011
is the frequency of appearance of the phrase in the text to be processed.
Illustratively, adjacent ones of the phrases in the text to be processed include
Figure 871850DEST_PATH_IMAGE012
Figure DEST_PATH_IMAGE013
And
Figure 538454DEST_PATH_IMAGE014
the phrase appears 6 times in the text to be processed, the phrase and
Figure 86110DEST_PATH_IMAGE012
Figure 52929DEST_PATH_IMAGE013
and
Figure 27839DEST_PATH_IMAGE014
the times of appearance of the corresponding spliced phrases in the text to be processed are respectively 2, 3 and 1, and then the phrases and the phrases
Figure 447319DEST_PATH_IMAGE012
Figure 798666DEST_PATH_IMAGE013
And
Figure 619991DEST_PATH_IMAGE014
the first probabilities of the corresponding stitched phrases are: 1/3, 1/2, and 1/6.
In one embodiment, the degrees of freedom of the phrase are:
Figure 31381DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE017
representing the degrees of freedom of the phrase.
In one embodiment, the left degree of freedom or the right degree of freedom of the phrase may be calculated based on the above expression, and only the calculated object needs to be changed from the stitched phrase of the phrase to the left stitched phrase of the phrase or the right stitched phrase of the phrase.
Optionally, fig. 3 is a schematic flowchart of the step S203 in the embodiment shown in fig. 2 of the present application, and as shown in fig. 3, the step S203 may include the following steps:
step S301, determining at least one candidate phrase from the phrases and the spliced phrases thereof according to the degree of aggregation of each phrase and the degree of freedom among the phrases.
In one embodiment, at least one candidate phrase may be determined from the plurality of phrases based on the degree of aggregation and the degree of freedom.
In one embodiment, lower limit values for the degree of aggregation and the degree of freedom may be set, and each phrase having both a degree of aggregation and a degree of freedom greater than the corresponding lower limit value may be determined as each candidate phrase.
Optionally, determining at least one candidate phrase from the plurality of phrases and their concatenated phrases according to the degree of aggregation of each phrase and the degree of freedom between the plurality of phrases, including:
screening the plurality of phrases according to the degree of aggregation, and deleting the phrases with the degree of aggregation lower than a preset degree of aggregation; and for each phrase after screening, determining the phrase or the spliced phrase of the phrase as a candidate phrase according to the degree of freedom of the phrase.
The preset degree of aggregation is a lower limit value of the degree of aggregation, and may be set manually, or a default value may be adopted, and may also be determined according to the total number of the participles of the text to be processed. The degree of freedom of a phrase is the degree of freedom of the phrase relative to other phrases or the degree of freedom between the phrase and one or more other phrases.
In one embodiment, it may be determined whether the degree of freedom of the filtered phrases is greater than a preset degree of freedom; if yes, determining the phrase as one of candidate phrases; if not, determining one or more spliced phrases of the phrase as candidate phrases.
In one embodiment, when the degree of freedom of a phrase is less than the preset degree of freedom, one or more stitched phrases may be further determined as candidate phrases from the respective stitched phrases of the phrase based on the first probability of the respective stitched phrases of the phrase. Such as determining the stitched phrase of the phrase with the highest first probability as the candidate phrase, or determining each stitched phrase of the phrase with the first probability greater than the preset probability as the candidate phrase.
In one embodiment, it may be determined whether the degree of freedom of the phrase is greater than a preset degree of freedom; if not, determining to combine the phrase to obtain a combined phrase, and if the phrase is combined with an adjacent phrase with the highest frequency of occurrence or the highest first probability in adjacent phrases, the first probability of the adjacent phrase is the first probability of a spliced phrase formed by the adjacent phrase and the phrase. And determining each candidate phrase from each phrase with the degree of freedom greater than the preset degree of freedom and each combined phrase based on the degree of aggregation. If phrases with a degree of aggregation greater than a preset degree of aggregation and a degree of freedom greater than a preset degree of freedom are determined as candidate phrases, and phrases with a degree of aggregation greater than a preset degree of aggregation are determined as candidate phrases.
Illustratively, taking the preset degree of aggregation as 50, the preset degree of freedom as 0.5, and the sum of the occurrence times of the respective participles of the text to be processed as 1000, the occurrence time or frequency of the phrase ph1 in the text to be processed is 10, and the occurrence probability of ph1 is 0.01; the ph1 is composed of participles v1 and v2, the occurrence times of v1 and v2 in the text to be processed are respectively as follows: 10 and 16, and v1 and v2 probabilities of occurrence are: 0.01 and 0.016, the degree of aggregation of the phrase ph1 is: 62.5, higher than the predetermined degree of cohesion. The left adjacent phrase of phrase ph1 includes: pl1, pl2, right adjacent phrases include: the occurrence batches of the splicing phrases of pl3, pl1, pl2 and pl3 and ph1 in the text to be processed are: 1. 8 and 1, then pl1, pl2, and pl3 have first probabilities of 0.1, 0.8, and 0.1, respectively, and ph1 has the following degrees of freedom: 0.2969, below a preset degree of freedom, the stitched phrase consisting of pl2 (the corresponding neighboring phrase with the highest first probability) and ph1 can be determined as one of the candidate phrases.
Step S302, determining the subject term of the text to be processed from the at least one candidate phrase according to the characteristics of the candidate phrases.
The characteristics of the candidate phrases comprise the frequency of appearance of the candidate phrases in the text to be processed and/or the frequency of appearance of each participle forming the candidate phrases in the text to be processed.
In one embodiment, whether the candidate phrase is a subject word may be determined according to parameters such as an average value and a maximum value of the frequency of occurrence of the participles constituting the candidate phrase in the text to be processed.
In one embodiment, if the frequency of occurrence of the candidate phrase in the text to be processed is greater than a preset frequency, such as 3, 5 or other values, the candidate phrase is determined to be the subject word.
In one embodiment, if the frequency of occurrence of the candidate phrase in the text to be processed is greater than a preset frequency and the frequency of occurrence of each participle forming the candidate phrase in the text to be processed satisfies a preset condition, if the frequency of occurrence of each participle forming the candidate phrase in the text to be processed is in a preset interval, the candidate phrase is determined to be a subject word. The upper limit of the preset interval may be 10, 15, 20, etc., and the lower limit of the preset interval may be 2, 3, 4, 5, or other values.
In one embodiment, the preset condition may be one or more of that an average value of the frequency of occurrence of each participle constituting the candidate phrase in the text to be processed is greater than a first numerical value, and that a maximum value of the frequency of occurrence of each participle constituting the candidate phrase in the text to be processed is greater than a second numerical value.
In one embodiment, the characteristics of the candidate phrase further include a length of the candidate phrase, and the length of the candidate phrase determined as the subject word may be within a set interval.
In one embodiment, the characteristics of the candidate phrases further include participle scores of the participles in the candidate phrases, and the participle scores can be determined according to one or more of the frequency of occurrence of the participles, the language of the participles, the inverse document frequency of the participles, the word coverage of the participles, the word association between the participles, and the like. The word coverage of the participle can be described by the ratio of the number of the clauses including the participle in the text to be processed to the total number of the clauses in the text to be processed. The word association between the participles is used to describe the degree of association of the participle with other participles in the text to be processed.
In one embodiment, the subject word of the text to be processed can be determined from the candidate phrases of the text to be processed according to one or more of the frequency of occurrence of the candidate phrase and the frequency of occurrence of each participle constituting the candidate phrase, and the participle score of each participle constituting the candidate phrase, so as to improve the accuracy of the subject word determination.
Fig. 4 is a flowchart illustrating a subject term determining method according to another embodiment of the present application, as shown in fig. 4, in this embodiment, based on the embodiment shown in fig. 3, a step of performing a word segmentation process on a text to be processed is added before step S201, and a step of calculating a word segmentation score is added before step S302, as shown in fig. 4, the subject term determining method according to this embodiment may include the following steps:
step S401, performing word segmentation processing on the text to be processed based on a word segmenter to obtain a plurality of clauses, a plurality of participles and the part of speech of each participle.
In an embodiment, the language corresponding to the text to be processed may be detected based on the preprocessing module, the word segmenter corresponding to the language corresponding to the text to be processed is called, and the text to be processed is subjected to clause, word segmentation, part-of-speech tagging and word normalization processing, so as to obtain each clause, each word segmentation and part-of-speech of each word of the text to be processed.
And step S402, drawing a weighted directed graph of the text to be processed.
Each node of the weighted directed graph corresponds to a participle of the text to be processed, and the value or weight of each edge is used for representing the occurrence frequency of a phrase formed by at least two participles connected by the edge.
In one embodiment, the weighted directed graph is also referred to as a participle directed graph. And drawing a weighted directed graph of the text to be processed based on the combination relation of the parts of words of the text to be processed and the occurrence frequency of the combined parts of words.
In one embodiment, the weighted directed graph may be created based on the participles in each window by traversing the to-be-processed text through a window with a window width k, and in the weighted directed graph, the weight or value of each edge may be the frequency of occurrence of two participles connected by the edge in the same window.
Exemplarily, fig. 5 is a schematic diagram of the weighted directed graph in the embodiment shown in fig. 4 of the present application, as shown in fig. 5, taking a partial clause of a text to be processed as an example, where the partial clause includes 10 participles, where the 10 participles correspond to nodes 1 to 10 one-to-one, an arrow between two nodes represents a combination relationship of the participles corresponding to the two nodes, and a number marked on the arrow represents an occurrence frequency of two participles connected by the arrow or a frequency of occurrence in the same window.
In one embodiment, the word frequency or the occurrence frequency in the text to be processed of each participle can be determined based on the weighted directed graph.
And step S403, determining word relevance among all the participles based on the weighted directed graph.
The word association between the participles may also be referred to as word association of the participles, and is used to describe the association degree between one participle and other participles.
In one embodiment, for each participle, the word relevance of the participle may be determined based on the value or weight of the edge in the weighted directed graph connected to the participle. The edges connected with the participle comprise edges pointing to the nodes corresponding to the participle and edges pointed out by the nodes corresponding to the participle.
For example, the word relevance of the participle corresponding to the node 2 in the weighted directed graph shown in fig. 5 is: 6 (1 +2+2+ 1), the word relevance of the participle corresponding to the node 5 is: 6(3+2+1).
Step S404, aiming at each participle, calculating the participle score of the participle according to one or more items of the frequency of occurrence, the word coverage, the inverse document frequency and the language of the participle and the word relevance among the participles.
Wherein the inverse document frequency of the participles may be determined based on the number of documents including the participles in the corpus and the total number of documents of the corpus.
In one embodiment, the frequency of occurrence of each participle in the text to be processed and the language of each participle, such as english, chinese, etc., may be counted, word relevance, frequency of occurrence, and word coverage of each participle may be calculated, and then the parameters of the participle may be digitized and normalized, and then the participle score of the participle may be calculated based on the one or more parameters.
In one embodiment, the participle score for a participle may be determined based on a weighted average of the frequency of occurrence of the participle, word coverage, word relevance, inverse document frequency, and language.
The evaluation of the word segmentation score is carried out through the multi-dimensional parameters of the word segmentation, the accuracy of the word segmentation score is improved, and a foundation is laid for the subsequent determination of the subject word.
Step S405, a plurality of phrases in the text to be processed are obtained, wherein each phrase comprises at least one word segmentation.
Step S406, calculating the degree of aggregation of each phrase and the degree of freedom among the plurality of phrases for the plurality of phrases.
Step S407, determining at least one candidate phrase from the plurality of phrases and their concatenated phrases according to the degree of aggregation of each phrase and the degree of freedom between the plurality of phrases.
Step S408, determining the subject word of the text to be processed from at least one candidate phrase according to the word segmentation score of each word segmentation forming the candidate phrase and the occurrence frequency of the candidate phrase in the text to be processed.
In one embodiment, for each candidate phrase, it may be determined whether the candidate phrase is a subject word of the text to be processed according to an average value of the segmentation scores of the segmentation words constituting the candidate phrase, a maximum value of the segmentation scores of the segmentation words constituting the candidate phrase, word parameters such as an occurrence frequency of the segmentation words constituting the candidate phrase, and phrase parameters such as an occurrence frequency of the candidate phrase in the text to be processed and a length of the candidate phrase.
Optionally, determining a subject term of the text to be processed from the at least one candidate phrase according to the characteristics of the candidate phrases, including:
calculating a phrase score of each candidate phrase according to the occurrence frequency of each participle forming the candidate phrase, the participle score of each participle and the phrase length of the candidate phrase; and determining the subject word of the text to be processed from each candidate phrase according to the phrase score of each candidate phrase.
In one embodiment, for each candidate phrase, a phrase score for the candidate phrase may be determined based on a weighted average of the frequency of occurrence of the respective participles that make up the candidate phrase, the participle score of the respective participle, and the phrase length of the candidate phrase.
In one embodiment, when the phrase length of the candidate phrase is greater than the first length, the phrase score decreases as the phrase length increases. When the phrase length of the candidate phrase is less than the second length, the phrase score increases as the phrase length increases. When the phrase length of the candidate phrase is between the first length and the second length, the relationship between the phrase length and the phrase score may be a positive correlation relationship, a negative correlation relationship, a non-monotonic relationship, or the like, and may be manually set. The higher the segmentation score of a segmentation, the higher the phrase score of the candidate phrase.
In one embodiment, when the frequency of occurrence of the word segmentation is less than or equal to a preset frequency, the word segmentation score and the frequency of occurrence are in a positive correlation relationship, and when the frequency of occurrence of the word segmentation is greater than the preset frequency, the word segmentation score and the frequency of occurrence are in a negative correlation relationship.
In one embodiment, the subject term of the text to be processed can be determined from each candidate phrase according to the phrase score of the candidate phrase and the similarity of the candidate phrase, so as to improve the diversity of the subject term.
In one embodiment, the candidate phrases may be grouped based on their similarity to divide the more similar candidate phrases into a group. And selecting the candidate phrase with the highest phrase score from each group as the subject word of the text to be processed.
In one embodiment, the subject term of the text to be processed may be determined from each candidate phrase according to the phrase score of the candidate phrase and the preset number of times, so that the number of occurrences of the participles in the set of the subject terms of the text to be processed is smaller than the preset number of times, thereby avoiding the number of occurrences of the participles in the subject term from being too high, and improving the diversity of the subject terms.
Illustratively, the preset number of times may be 2, 3, 4, or other times.
In one embodiment, the candidate phrases may be sorted based on the phrase scores, or subject term judgment may be performed on each candidate phrase in sequence according to the order from high to low of the phrase scores, so as to ensure that the occurrence frequency of each participle in the finally determined set of subject terms of the text to be processed is less than the preset frequency.
Optionally, determining the subject term of the text to be processed from each candidate phrase according to the phrase score of each candidate phrase, including:
and determining a topic word set of the text to be processed from each candidate phrase according to the similarity, preset times and the phrase score of each candidate phrase, wherein the topic word set comprises at least one topic word, the occurrence times of each participle forming the topic word set in the topic word set are less than the preset times, and the similarity between the candidate phrases is determined by the edit distance and the vector distance between the candidate phrases.
In one embodiment, candidate phrases may be first filtered based on phrase scores to remove candidate phrases with lower phrase scores. And determining a subject word set of the text to be processed from each candidate phrase after screening based on parameters such as preset times, similarity, each participle forming the phrase and the like so as to improve the diversity of the subject words.
In one embodiment, after determining the phrase score of each candidate phrase, the candidate phrases may be ranked based on the phrase score, and subject word determination is performed on each candidate phrase sequentially based on the ranking result, that is, whether the candidate phrase is one of the subject words is determined, and if yes, the candidate phrase is put into the subject word set to update the subject words in the subject word set. When the candidate phrase is put into the topic word set or the topic word set is updated, the occurrence frequency of each participle in the topic word set needs to be updated. Firstly, determining a candidate word with the highest phrase score as a subject word, namely putting the candidate word into a subject word set, judging whether the sum of the occurrence frequency of the first participle in the subject word set and the occurrence frequency (usually 1) of the first participle in a candidate phrase is greater than a preset frequency or not when performing subject word judgment on subsequent candidate phrases, and if so, determining that the candidate phrase is not the subject word of the text to be processed. The first participles are participles existing in both the candidate phrases and the subject word set. And if the similarity between the candidate phrase and any one of the subject words in the current subject word set is greater than the preset similarity, determining that the candidate phrase is not the subject word of the text to be processed.
Illustratively, the candidate phrases of the text to be processed are, in order from high to low according to the phrase score: ph5, ph3, ph8, ph10 and ph15, wherein the current topic word set includes ph5 and ph3, when the topic word determination is performed on ph8, the similarity between ph8 and ph5 and ph3 and the occurrence frequency of the participles in the set corresponding to ph5, ph3 and ph8 need to be calculated, and whether ph8 is determined as the topic word is determined based on the similarity and the occurrence frequency of the participles.
Exemplarily, fig. 6 is a schematic diagram illustrating the judgment of candidate phrase topic words in the embodiment shown in fig. 4 of the present disclosure, and as shown in fig. 6, candidate phrases with a score higher than 60 for a text phrase to be processed sequentially include, from high to low, the following phrases: "speech recognition technology", "long speech recognition", "spoken language recognition" and "neural network model". If the current topic word set includes a speech recognition technology, in order to improve the diversity of the topic words, the similarity between "long speech recognition" and "speech recognition technology" is high, then "long speech recognition" is not determined as the topic words, and "spoken language recognition" and "speech recognition technology" both include participles "recognition", so as to avoid that the same participle in the topic words appears for many times, then "spoken language recognition" is not determined as the topic words, and the final output topic word set, that is, the final topic word set consists of: the 'speech recognition technology' and the 'neural network model'.
In one embodiment, an upper limit value of the subject term may be further set to avoid that the user cannot quickly and accurately grasp the main content of the text to be processed due to an excessive number of subject terms.
In an embodiment, after determining the subject term of the text to be processed, the subject term of the text to be processed may be further output or displayed, or the audio data or the audio/video data corresponding to the text to be processed may be labeled based on the subject term of the text to be processed.
In one embodiment, the subject term of the text to be processed may be sent to the user terminal.
In the embodiment, the boundaries of phrases are determined based on the aggregation and the freedom of the phrases of the text to be processed, and a plurality of candidate phrases are obtained to ensure the integrity of the subject word; the method comprises the steps of determining word relevance of each participle based on a weighted directed graph of the participle of a text to be processed, calculating the participle score of the participle by combining multidimensional parameters such as the word relevance, the frequency of occurrence, word coverage, language, inverse document frequency and the like, calculating the phrase score of each candidate phrase based on the participle score, the frequency of occurrence and the like of the participle and the length, frequency of occurrence and the like of the candidate phrase, determining the subject word of the text to be processed based on the score of the candidate phrase, and improving the accuracy of determining the subject word so that a user can quickly know the main content of the text to be processed or audio and video data corresponding to the text to be processed based on the subject word.
Fig. 7 is a schematic flowchart of a subject term determination method according to another embodiment of the present application, where for a meeting scenario, as shown in fig. 7, the subject term determination method includes the following steps:
optionally, for a scene of an online conference, a method for determining a subject term may include:
step S701, audio data of a conference collected by a conference system is obtained, and a text to be processed is generated according to the audio data.
Step S702, a plurality of phrases in the text to be processed are obtained, wherein each phrase comprises at least one word segmentation.
Step S703, for the plurality of phrases, calculates the degree of aggregation of each phrase and the degree of freedom between the plurality of phrases.
Wherein, the degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom is used for representing the fixed degree of one phrase and the adjacent phrases.
Step S704, determining the subject term of the conference according to the degree of aggregation of each phrase and the degree of freedom between the plurality of phrases.
In one embodiment, a conferencing system may include one or more user terminals and a processing device. The audio data may be collected by the user terminal, and the subject term determination method may be executed by the processing device. The user can acquire audio data of a conference in a conference site or an online conference process through the user terminal, and then report the audio data to the processing device, so that the processing device executes the subject term determination method provided by the embodiment of the application, and the subject term of the conference is output, so that the user can write a conference summary conveniently, or the user can review conference contents quickly.
In one embodiment, the conferencing system can include a projector, a conferencing device coupled to the projector to present conferencing content, and a processing device. The audio data can be collected by the conference equipment and sent to the processing device so as to realize the extraction of the subject words of the audio data of the conference.
In one embodiment, the conference system may include a server and a plurality of user terminals, and the server is connected to each of the user terminals through a network to implement an online conference. The user can participate in the online conference through conference software installed on the user terminal, and acquires the audio data of the online conference, and then the server or the user terminal executes the subject term determination method provided by the embodiment of the application, so that the subject term extraction of the audio data of the online conference is realized.
Optionally, for an online education scenario, a method for determining a subject term may include: acquiring audio data acquired by an education auxiliary system, and generating a text to be processed according to the audio data; acquiring a plurality of phrases in the text to be processed, wherein each phrase comprises at least one word segmentation; calculating a degree of aggregation of each phrase and a degree of freedom among the phrases, wherein the degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom is used for representing the fixed degree of one phrase and the adjacent phrases; and determining subject words of the audio data according to the degree of cohesion of each phrase and the degree of freedom among the plurality of phrases.
By extracting the subject term of the audio data in the course of teaching and adding the subject term to the teaching video, the user can quickly master the main content of the video based on the subject term, so that course selection can be conveniently carried out.
It should be understood that the subject term determination method provided in any one of the embodiments corresponding to fig. 2 to fig. 4 may be used to determine the subject term of the text to be processed for audio data conversion in each of the above scenarios, and specific steps and technical effects thereof are similar, and details of this embodiment are not repeated herein.
Fig. 8 is a schematic structural diagram of a subject term determination device according to an embodiment of the present application, and as shown in fig. 8, the subject term determination device according to the embodiment includes: phrase acquisition module 810, phrase parameter calculation module 820, and subject word determination module 830.
The phrase obtaining module 810 is configured to obtain a plurality of phrases in a text to be processed, where each phrase includes at least one word segmentation; a phrase parameter calculating module 820, configured to calculate, for the plurality of phrases, a degree of aggregation of each phrase and a degree of freedom between the plurality of phrases, where the degree of aggregation is used to describe a probability that each participle in one phrase appears simultaneously, and the degree of freedom is used to characterize a fixed degree of one phrase and its neighboring phrases; a topic word determining module 830, configured to determine a topic word of the text to be processed according to the aggregation of each phrase and the degrees of freedom between the plurality of phrases.
Optionally, the apparatus further comprises:
and the text to be processed generation module is used for acquiring audio data of the conference collected by the conference system and generating a text to be processed according to the audio data.
Accordingly, the topic word determination module 830 is configured to:
and determining subject words of the conference according to the degree of aggregation of each phrase and the degrees of freedom among the plurality of phrases.
Optionally, the phrase parameter calculating module 820 includes:
the cohesion degree calculation unit is used for calculating the cohesion degree of the Nth phrase according to the appearance frequency of the Nth phrase in the text to be processed and the appearance frequency of each participle forming the Nth phrase in the text to be processed aiming at the Nth phrase; a freedom degree calculation unit, configured to calculate, for an nth phrase, a freedom degree of the nth phrase according to an occurrence frequency of the nth phrase in the text to be processed and an occurrence frequency of each stitched phrase of the nth phrase in the text to be processed, where a stitched phrase of the nth phrase includes: the nth phrase, and phrases adjacent to the nth phrase; n is a positive integer.
Optionally, the degree of freedom calculation unit is specifically configured to:
respectively calculating first probabilities of the spliced phrases appearing under the condition that the Nth phrase appears according to the frequency of the Nth phrase appearing in the text to be processed and the frequency of each spliced phrase of the Nth phrase appearing in the text to be processed; and determining the degree of freedom of the Nth phrase according to the information entropy of the first probability of each spliced phrase.
Optionally, the topic word determination module 830 includes:
a candidate phrase determining unit, configured to determine at least one candidate phrase from the plurality of phrases and their concatenated phrases according to the degree of aggregation of each phrase and the degree of freedom between the plurality of phrases; and the subject word determining unit is used for determining the subject word of the text to be processed from the at least one candidate phrase according to the characteristics of the candidate phrases, wherein the characteristics of the candidate phrases comprise the frequency of the candidate phrases in the text to be processed and/or the frequency of each participle forming the candidate phrases in the text to be processed.
Optionally, the candidate phrase determining unit is specifically configured to:
screening the plurality of phrases according to the degree of aggregation, and deleting the phrases with the degree of aggregation lower than a preset degree of aggregation; and for each phrase after screening, determining the phrase or the spliced phrase of the phrase as a candidate phrase according to the degree of freedom of the phrase.
Optionally, the characteristics of the candidate phrases further include word segmentation scores of the word segments in the candidate phrases, and the apparatus further includes:
the word segmentation processing module is used for carrying out word segmentation processing on the text to be processed based on the word segmenter to obtain a plurality of clauses, a plurality of participles and the part of speech of each participle; and the word segmentation score calculation module is used for calculating the word segmentation score of each word segmentation according to one or more items of the frequency of occurrence of the word segmentation, the word coverage, the inverse document frequency and the language, wherein the word coverage is the ratio of the number of the clauses including the word segmentation to the total number of the clauses of the text to be processed.
Optionally, the apparatus further comprises:
the association degree calculation module is used for drawing a weighted directed graph of the text to be processed after word segmentation processing is carried out on the text to be processed based on a word segmenter to obtain each clause, each word segmentation and the part of speech of each word segmentation, wherein each node of the weighted directed graph corresponds to one word segmentation of the text to be processed, and the value of each edge is used for representing the occurrence frequency of a phrase formed by at least two words connected by the edge; determining word relevance among the participles based on the weighted directed graph, wherein the word relevance among the participles is used for describing the relevance degree of one participle and other participles;
correspondingly, the word segmentation score calculation module is specifically configured to:
and calculating the word segmentation score of the word segmentation according to one or more items of the frequency of occurrence, word coverage, inverse document frequency and language of the word segmentation and the word relevance of the word segmentation.
Optionally, the topic word determination unit includes:
the phrase score calculating subunit is used for calculating the phrase score of each candidate phrase according to the occurrence frequency of each participle forming the candidate phrase, the participle score of each participle and the phrase length of the candidate phrase; and the subject word calculation subunit is used for determining the subject words of the text to be processed from the candidate phrases according to the phrase scores of the candidate phrases.
Optionally, the subject term calculating subunit is specifically configured to:
and determining a topic word set of the text to be processed from each candidate phrase according to the similarity, preset times and the phrase score of each candidate phrase, wherein the topic word set comprises at least one topic word, the occurrence times of each participle forming the topic word set in the topic word set are less than the preset times, and the similarity between the candidate phrases is determined by the edit distance and the vector distance between the candidate phrases.
The topic word determination apparatus provided in the embodiment of the present application may be configured to implement the technical solutions provided in any embodiments corresponding to fig. 2 to fig. 4 and fig. 7, and the implementation principles and technical effects thereof are similar, and this embodiment is not described herein again.
Fig. 9 is a schematic structural diagram of a subject term determination apparatus according to an embodiment of the present application, and as shown in fig. 9, the subject term determination apparatus according to the embodiment includes:
at least one processor 910; and a memory 920 communicatively coupled to the at least one processor; wherein the memory 920 stores computer-executable instructions executable by the at least one processor 910, and the at least one processor 910 executes the computer-executable instructions stored by the memory 920 to cause the subject term determination apparatus to perform the subject term determination method according to any of the embodiments described above.
Alternatively, the memory 920 may be separate or integrated with the processor 910.
The implementation principle and technical effect of the subject term determination device provided by this embodiment may be referred to in the foregoing embodiments, and are not described herein again.
An embodiment of the present application further provides a computer-readable storage medium, where a computer-executable instruction is stored in the computer-readable storage medium, and when a processor executes the computer-executable instruction, the method for determining a topic word provided in any one of the foregoing embodiments is implemented.
The present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the topic word determination method provided in any of the foregoing embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods provided in the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (12)

1. A method for topic word determination, the method comprising:
acquiring a plurality of phrases in a text to be processed, wherein each phrase comprises at least one word segmentation;
calculating a degree of aggregation of each phrase and a degree of freedom among the phrases, wherein the degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom is used for representing the fixed degree of one phrase and the adjacent phrases;
and determining the subject term of the text to be processed according to the degree of aggregation of each phrase and the degree of freedom among the plurality of phrases.
2. The method of claim 1, wherein calculating the degree of aggregation for each phrase and the degrees of freedom between the plurality of phrases comprises:
for an Nth phrase, calculating the degree of cohesion of the Nth phrase according to the frequency of occurrence of the Nth phrase in the text to be processed and the frequency of occurrence of each participle forming the Nth phrase in the text to be processed;
calculating the degree of freedom of the Nth phrase according to the occurrence frequency of the Nth phrase in the text to be processed and the occurrence frequency of each spliced phrase of the Nth phrase in the text to be processed, wherein the spliced phrase of the Nth phrase comprises: the nth phrase, and phrases adjacent to the nth phrase; n is a positive integer.
3. The method of claim 2, wherein calculating the degree of freedom of the nth phrase according to the frequency of occurrence of the nth phrase in the text to be processed and the frequency of occurrence of each concatenated phrase of the nth phrase in the text to be processed comprises:
respectively calculating first probabilities of the spliced phrases appearing under the condition that the Nth phrase appears according to the frequency of the Nth phrase appearing in the text to be processed and the frequency of each spliced phrase of the Nth phrase appearing in the text to be processed;
and determining the degree of freedom of the Nth phrase according to the information entropy of the first probability of each spliced phrase.
4. The method according to claim 2 or 3, wherein determining the subject term of the text to be processed according to the degree of aggregation of each phrase and the degree of freedom between the plurality of phrases comprises:
determining at least one candidate phrase from the plurality of phrases and the spliced phrases thereof according to the degree of aggregation of each phrase and the degree of freedom among the plurality of phrases;
and determining the subject word of the text to be processed from the at least one candidate phrase according to the characteristics of the candidate phrases, wherein the characteristics of the candidate phrases comprise the frequency of occurrence of the candidate phrases in the text to be processed and/or the frequency of occurrence of each participle forming the candidate phrases in the text to be processed.
5. The method of claim 4, wherein determining at least one candidate phrase from the plurality of phrases and their stitched phrases according to the degree of aggregation of each phrase and the degrees of freedom between the plurality of phrases comprises:
screening the plurality of phrases according to the degree of aggregation, and deleting the phrases with the degree of aggregation lower than a preset degree of aggregation;
and for each phrase after screening, determining the phrase or the spliced phrase of the phrase as a candidate phrase according to the degree of freedom of the phrase.
6. The method of claim 4, wherein the characteristics of the candidate phrases further include a segmentation score for a segmentation in the candidate phrases, the method further comprising:
performing word segmentation processing on the text to be processed based on a word segmentation device to obtain a plurality of clauses, a plurality of participles and the part of speech of each participle;
and for each participle, calculating a participle score of the participle according to one or more items of the frequency of occurrence of the participle, word coverage, inverse document frequency and language, wherein the word coverage is the ratio of the number of the participles including the participle to the total number of the participles of the text to be processed.
7. The method of claim 6, wherein after performing word segmentation processing on the text to be processed based on a word segmenter to obtain a plurality of clauses, a plurality of participles and a part-of-speech of each participle, the method further comprises:
drawing a weighted directed graph of the text to be processed, wherein each node of the weighted directed graph corresponds to one participle of the text to be processed, and the value of each edge is used for representing the occurrence frequency of a phrase formed by at least two participles connected by the edge;
determining word relevance among the participles based on the weighted directed graph, wherein the word relevance among the participles is used for describing the relevance degree of one participle and other participles;
calculating the word segmentation score of the word segmentation according to one or more items of the occurrence frequency, the word coverage, the inverse document frequency and the language of the word segmentation, and the method comprises the following steps:
and calculating the word segmentation scores of the segmented words according to one or more items of the occurrence frequency, word coverage, inverse document frequency and language of the segmented words and the word relevance among the segmented words.
8. The method of claim 6, wherein determining the subject term of the text to be processed from the at least one candidate phrase according to the characteristics of the candidate phrase comprises:
calculating a phrase score of each candidate phrase according to the occurrence frequency of each participle forming the candidate phrase, the participle score of each participle and the phrase length of the candidate phrase;
and determining the subject word of the text to be processed from each candidate phrase according to the phrase score of each candidate phrase.
9. The method of claim 8, wherein determining the subject word of the text to be processed from each candidate phrase according to the phrase score of each candidate phrase comprises:
and determining a topic word set of the text to be processed from each candidate phrase according to the similarity, preset times and the phrase score of each candidate phrase, wherein the topic word set comprises at least one topic word, the occurrence times of each participle forming the topic word set in the topic word set are less than the preset times, and the similarity between the candidate phrases is determined by the edit distance and the vector distance between the candidate phrases.
10. A method for topic word determination, the method comprising:
acquiring audio data of a conference collected by a conference system, and generating a text to be processed according to the audio data;
acquiring a plurality of phrases in the text to be processed, wherein each phrase comprises at least one word segmentation;
calculating a degree of aggregation of each phrase and a degree of freedom among the phrases, wherein the degree of aggregation is used for describing the probability of the simultaneous occurrence of each participle in one phrase, and the degree of freedom is used for representing the fixed degree of one phrase and the adjacent phrases;
and determining subject words of the conference according to the degree of aggregation of each phrase and the degrees of freedom among the plurality of phrases.
11. A subject word determination device, comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored in the memory to implement the subject term determination method of any of claims 1-10.
12. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, perform the method of determining a subject term according to any one of claims 1-10.
CN202210143658.2A 2022-02-17 2022-02-17 Method, device and storage medium for determining subject term Pending CN114186557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210143658.2A CN114186557A (en) 2022-02-17 2022-02-17 Method, device and storage medium for determining subject term

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210143658.2A CN114186557A (en) 2022-02-17 2022-02-17 Method, device and storage medium for determining subject term

Publications (1)

Publication Number Publication Date
CN114186557A true CN114186557A (en) 2022-03-15

Family

ID=80546077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210143658.2A Pending CN114186557A (en) 2022-02-17 2022-02-17 Method, device and storage medium for determining subject term

Country Status (1)

Country Link
CN (1) CN114186557A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077274A (en) * 2014-06-13 2014-10-01 清华大学 Method and device for extracting hot word phrases from document set
CN104298746A (en) * 2014-10-10 2015-01-21 北京大学 Domain literature keyword extracting method based on phrase network diagram sorting
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
US20180366013A1 (en) * 2014-08-28 2018-12-20 Ideaphora India Private Limited System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter
CN109918660A (en) * 2019-03-04 2019-06-21 北京邮电大学 A kind of keyword extracting method and device based on TextRank
CN112560448A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 New word extraction method and device
CN113033183A (en) * 2021-03-03 2021-06-25 西北大学 Network new word discovery method and system based on statistics and similarity
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077274A (en) * 2014-06-13 2014-10-01 清华大学 Method and device for extracting hot word phrases from document set
US20180366013A1 (en) * 2014-08-28 2018-12-20 Ideaphora India Private Limited System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter
CN104298746A (en) * 2014-10-10 2015-01-21 北京大学 Domain literature keyword extracting method based on phrase network diagram sorting
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN109918660A (en) * 2019-03-04 2019-06-21 北京邮电大学 A kind of keyword extracting method and device based on TextRank
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method
CN112560448A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 New word extraction method and device
CN113033183A (en) * 2021-03-03 2021-06-25 西北大学 Network new word discovery method and system based on statistics and similarity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
夏天: "《面向中文学术文本的单文档关键短语抽取》", 《数据分析与知识发现》 *

Similar Documents

Publication Publication Date Title
WO2022095380A1 (en) Ai-based virtual interaction model generation method and apparatus, computer device and storage medium
WO2015062482A1 (en) System and method for automatic question answering
WO2020140373A1 (en) Intention recognition method, recognition device and computer-readable storage medium
CN109815491B (en) Answer scoring method, device, computer equipment and storage medium
CN109284502B (en) Text similarity calculation method and device, electronic equipment and storage medium
CN109657137B (en) Public opinion news classification model construction method, device, computer equipment and storage medium
CN111274442B (en) Method for determining video tag, server and storage medium
WO2021134524A1 (en) Data processing method, apparatus, electronic device, and storage medium
CN108846138B (en) Question classification model construction method, device and medium fusing answer information
CN110895656B (en) Text similarity calculation method and device, electronic equipment and storage medium
US20210151038A1 (en) Methods and systems for automatic generation and convergence of keywords and/or keyphrases from a media
CN111767715A (en) Method, device, equipment and storage medium for person identification
CN108241856A (en) Information generation method and equipment
CN116882372A (en) Text generation method, device, electronic equipment and storage medium
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN116738250A (en) Prompt text expansion method, device, electronic equipment and storage medium
CN111062221A (en) Data processing method, data processing device, electronic equipment and storage medium
WO2022083132A1 (en) Animation draft generation method and apparatus based on character paragraph
CN113128205A (en) Script information processing method and device, electronic equipment and storage medium
TWI725375B (en) Data search method and data search system thereof
CN109918661B (en) Synonym acquisition method and device
CN109145261B (en) Method and device for generating label
CN114186557A (en) Method, device and storage medium for determining subject term
CN116055825A (en) Method and device for generating video title
CN114155841A (en) Voice recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination