CN113505597A

CN113505597A - Method, device and storage medium for extracting keywords in video conference

Info

Publication number: CN113505597A
Application number: CN202110848123.0A
Authority: CN
Inventors: 李璐; 冯文澜
Original assignee: Suirui Technology Group Co Ltd
Current assignee: Suirui Technology Group Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-10-15

Abstract

The invention discloses a method, a device and a storage medium for extracting keywords in a video conference, belonging to the technical field of video multimedia communication, wherein the method comprises the following steps: s1: acquiring voices of an interlocutor and a host, recognizing characters in the voices, and segmenting the characters; s2: extracting keywords from the word after word segmentation; s3: extracting keywords from the voices of the interlocutor and the host respectively, calculating the similarity of the keywords by adopting a cosine similarity calculation method, and if the similarity is not less than a threshold value, performing step S5; s4: extracting keywords from the voices of the interlocutor and the moderator respectively, calculating the semantic distance of the extracted keywords, and if the semantic distance is not less than a threshold value, performing step S5; wherein, the steps S3 and S4 are synchronously performed; s5: and displaying the voice content of the interlocutor. The invention can standardize the conference discipline and refine the classroom questioning process based on the combination of automatic word segmentation, keyword extraction, semantic distance and cosine similarity.

Description

Method, device and storage medium for extracting keywords in video conference

Technical Field

The invention belongs to the technical field of video multimedia communication, and particularly relates to a method and a device for refining keywords in a video conference and a storage medium.

Background

At present, in a video conference, due to the situation of divergent disputes, when the conference calls and talks at the same time, the conference is disordered, and the conference order is disturbed. Another phenomenon is that in the live broadcast process of a teacher, when students have difficulties and cannot understand, the teacher calls the names or sees the questions in question answering areas one by one according to personal wishes, so that the questions are scientifically and reasonably classified and answered after all the questions are not listened to in a classroom, and the situation that the students cannot understand the knowledge points and the next knowledge points are not understood in the classroom is caused, and the whole class of the students falls down.

Currently, in a video conference, the following problems mainly exist in live courses:

in the meeting process, the speakers are explaining and the speakers are not expressed, and the situation that other participants forcibly interrupt the insertion is often seen, so that the conference is disordered, the thought of the participants is interrupted, and even the conference effect is seriously influenced.

In the live broadcast process, the problem can be solved because the suspicious expressions of classmates cannot be paid attention to on line and are collected through a camera, and in addition, because the classroom time is limited, teachers cannot answer questions after listening to or watching all questions, the similarity and the correlation of the problems can be judged through cosine similarity and semantic distance according to the following technology, so that the problem encountered in live broadcast courses is solved.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

The invention aims to provide a method, a device and a storage medium for extracting keywords in a video conference, which can standardize conference discipline and extract a classroom questioning process based on the combination of automatic word segmentation, keyword extraction, semantic distance and cosine similarity.

In order to achieve the above object, the method for extracting keywords in a video conference provided by the invention comprises the following steps:

s1: acquiring voices of an interlocutor and a host, recognizing characters in the voices, and segmenting the characters;

s2: extracting keywords from the word after word segmentation;

s3: extracting keywords from the voices of the interlocutor and the host respectively, calculating the similarity of the keywords by adopting a cosine similarity algorithm, judging whether the similarity is smaller than a similarity threshold, and if the similarity is not smaller than the similarity threshold, performing step S5; if the similarity is less than the similarity threshold, terminating the step;

s4: extracting keywords from the voices of the interlocutor and the moderator respectively, calculating the semantic distance of the extracted keywords, judging whether the semantic distance is smaller than a semantic distance threshold value, if the semantic distance is not smaller than the semantic distance threshold value, performing step S5, and if the semantic distance is smaller than the semantic distance threshold value, terminating the step; wherein, the steps S3 and S4 are synchronously performed;

wherein steps S3 and S4 are performed synchronously

S5: and displaying the voice content of the interlocutor.

Further, in the step S1, the voice of the moderator selects the voice 30 seconds before the insertion node of the interlocutor.

Further, the step S1 further includes: the camera identifies the expressions of the participants in the conference, judges whether expression changes occur, if so, performs the next step S2, otherwise, repeats the step S1.

Further, in the step S2, a part-of-speech tagging and TF-IDF weighting method are combined to extract the keyword.

Further, the step S2 includes:

s201: removing the null words through part-of-speech tagging;

s202: calculating the weight of the keyword by using a TF-IDF weight method;

s203: vectors of the voices of the interlocutor and the moderator are obtained, respectively.

Further, the step S3 includes: and respectively extracting voices from the voices of the interlocutor and the host to obtain vectors of the voices so as to calculate the cosine similarity.

Further, the step S3 further includes: judging whether the keywords in the voices of the interlocutor and the moderator contain synonyms or antisense words, if the two voices contain synonyms or antisense words, adding 1 to the word frequency of the corresponding keywords in the word frequency vector used for generation in the voices of the interlocutor and the moderator.

Further, the step S4 includes: performing part-of-speech arrangement on keywords in voices of a caller and a host, performing semantic distance comparison on the keywords with the same part-of-speech, and judging whether the semantic distance is greater than a threshold value;

wherein, the threshold value is set according to the number of the keywords in the voice.

The invention also provides a device for extracting the keywords in the video conference, which is connected with the conference terminal and comprises: the device comprises a storage module, an identification module and a microprocessor;

the storage module is used for storing the voices of the interlocutor and the host;

the recognition module is used for recognizing the voice of the interlocutor and sending the voice to the microprocessor for processing when recognizing the voice of the interlocutor;

the microprocessor is used for processing the voices of the interlocutor and the host by adopting the method for extracting the keywords in the video conference and sending the processing result to the conference terminal.

The present invention also provides a storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the above-mentioned method for extracting keywords in a video conference.

The method, the device and the storage medium for extracting the keywords in the video conference have the following beneficial effects that:

1. and converting the voice of the speaker in the first 30 seconds into characters according to the condition that two voices speak simultaneously in the meeting process as a trigger, comparing cosine similarity, and playing the voice of the interlocutor during the pause scientifically and intelligently by taking the condition that the similarity is not less than 0.6 or the semantic distance is within 3 as a judgment condition.

2. In the live classroom, questions of students are screened and classified through cosine similarity and semantic distance, classroom problems are simplified and refined, teachers can answer the questions selectively, and teacher and student time is saved.

3. The processing plan can be configured for different use scenes, industries and the like, and the practicability is higher.

4. Combining automatic word segmentation, keyword extraction, cosine similarity and semantic distance, and analyzing and judging whether the conditions of playing the voice of the interlocutor are met or not according to the cosine similarity and the semantic distance; when the method is applied to a live-broadcast classroom, induction and screening of problem of puzzled knowledge points in the live-broadcast curriculum can be tracked in real time, and time of teachers and students is saved.

Drawings

Fig. 1 is a flowchart of a method for refining keywords in a video conference according to this embodiment.

Fig. 2 is a schematic diagram of an apparatus for refining keywords in a video conference according to this embodiment.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.

As shown in fig. 1, the present invention provides a method for refining keywords in a video conference, which triggers when an interlocutor and a host speak simultaneously, identifies the voice of the interlocutor, compares and identifies the voice with the voice of the host to obtain the similarity and the correlation of the content of the interlocutor, and plays the voice content of the interlocutor when the host pauses if the similarity and the correlation match.

By the method, the situation that the interlocutor interrupts the speaker can be prevented in time, the voice of the interlocutor can be reasonably played, the conference order is maintained, and an intelligent guarantee effect is provided for smooth conference. The technology is also suitable for a classroom of two teachers, in the live broadcast process of the teachers, the camera collects the suspicious expressions, after the teachers are reminded, all problems can be collected in a voice or text mode, then whether each problem is closely related to a knowledge point or not is identified through setting a threshold value of the semantic distance, whether the problem is the content required to be mastered by the knowledge point or not is judged, meanwhile, cosine similarity comparison is carried out according to different problems, the same or similar problems can be processed in a unified mode, and the teachers can screen the same or similar problems according to the semantic distance and the cosine similarity and classify the same or similar problems. The repeated problems can be omitted by processing, and the problems are arranged according to the relevance, so that the time of teachers and students is saved.

The method comprises the following steps:

s1: and acquiring voices of the interlocutor and the host, recognizing characters in the voices, and automatically segmenting the characters.

Wherein the voice of the moderator selects the voice of the previous 30 seconds of the interlude node of the interlude.

And automatically segmenting words in the sentence by looking up a word table for the words in the meeting. It is usually set to 8 for the longest word in the lookup table. An example of automatic word segmentation is as follows: the policy of right of stock incentive comes into effect at the end of the month, and the FMM segmentation result is as follows: this/the/stock right incentive/policy/at/month end/effect/.

When the step S1 is applied to a conference scene such as online lecture, the method may further include the steps of: the method comprises the steps of recognizing expressions of participants in a conference through a camera, judging whether expression changes occur (such as expression which is puzzled when the participants give lessons on the internet), if so, carrying out the next step S2, and if not, repeating the step.

The expression recognition adopts image acquisition, face recognition, preprocessing, micro-expression discovery, feature extraction and micro-expression classification methods to extract the puzzled expression.

S2: and extracting keywords of the words after word segmentation.

The keywords are extracted by combining part-of-speech tagging and the TF-IDF weight method, and the keywords can be extracted more accurately by the method.

The specific steps of step S2 are as follows:

s201: and removing the null words through part-of-speech tagging to obtain effective feature vectors as candidate words, wherein the null words comprise ' yes ', … ' stop words, functional words and the like.

S202: the weight of the keyword is calculated by using a TF-IDF weight method, which comprises the following steps:

the word frequency tf (term frequency) represents the probability of the word t appearing in the speech d, and the calculation formula is as follows: TF ═ (number of times the word t appears here)/(total number of words in the content);

the calculation formula of the inverse text frequency index IDF (inverse Document frequency) is as follows: IDF ═ log (corpus total number of voices/(number of voices containing word t +1)), can be simplified as the formula: IDF is log (D/Dt), where D is the total number of voices in the corpus and Dt is the number of voices containing the word t plus 1;

the weighted value Weight of TF-IDF is calculated by the formula: weight TF IDF.

The corpus is a relatively large-scale corpus, a plurality of voice documents are contained in the corpus, some voice documents contain candidate words, and some voice documents do not contain the candidate words, so the total number of the documents containing the candidate words is calculated, and then the total number of the voice documents in the corpus and the total number of the voice documents containing the candidate words are brought into the formula, so that the Weight value Weight of the TF-IDF can be obtained.

S203: respectively obtaining the vectors of the voices of the interlocutor and the host; after the calculation, each keyword corresponds to one coordinate axis, and the value of the voice on each coordinate axis is the Weight of the corresponding keyword. Therefore, the vector corresponding to speech d is obtained as v (d) { weight1, weight2, … weight }; generally, the keywords are set to be sorted according to weight, and the number of the keywords with the largest weight is selected to be 5 at most, namely n is 5 at most.

In addition, the method can be applied to the generation of keywords of recorded courses or conference summaries, namely, the keywords are accurately extracted by combining a word span algorithm and the principle that the distance between the first appearance and the last appearance of the words is larger and more important.

S3: extracting keywords from the voices of the interlocutor and the host respectively, calculating the similarity of the keywords by adopting a cosine similarity algorithm, judging whether the similarity is smaller than a similarity threshold, and if the similarity is not smaller than the similarity threshold, performing step S5; if the similarity is less than the similarity threshold, the step is terminated.

Through the steps, cosine similarity can be identified for the contents of the two voices, and vector features are identified through the cosine similarity, and the method specifically comprises the following steps:

extracting voices a and b from voices of a interlocutor and a moderator respectively to obtain vectors of the voices, wherein v (a) is { N1, N2 and … Nn }, v (b) is { N1, N2 and … Nn }, N is the word frequency (the frequency of occurrence) of corresponding keywords, and N is the number of the keywords with the largest weight selected in the step S2; secondly, combining the two keywords into a collection to generate respective word frequency vectors, and calculating cosine similarity cos (v (a), v (b)), wherein the closer the value (meaning cos (v (a), v (b)) is to 1, the higher the similarity is. Judging according to a set similarity threshold, and if the similarity exceeds the threshold, performing step S5 to play the voice of the interlocutor; if the similarity does not exceed the threshold, the step is terminated.

In the invention, the threshold of the similarity is set to 0.6, namely if the similarity of the keywords in the voices of the interlocutor and the moderator is not less than 0.6, the keywords in the voice of the interlocutor are reserved, and if the similarity is less than 0.6, the step is abandoned.

Meanwhile, in the step, the synonym table and the anti-sense table can be combined for judgment, namely whether the keywords in the voices of the interlocutor and the host contain synonyms or anti-sense words or not is judged; and if the two voices contain synonyms or antisense words, adding 1 to the word frequency of the keywords in the voices of the interlocutor and the moderator, namely adding 1 to the word frequency of the corresponding keywords in the generated word frequency vector, and otherwise, not operating. The purpose of the above steps is not to judge whether the contents or viewpoints of two voices are completely identical or similar, but to focus on whether the two voices are the same subject.

An example of this step S3 is as follows:

the word frequency vector of the keyword in the speech 1 is: change (1)/contract (1)/lot (1)/replenish (0)/agreement (1); the keyword word frequency vector in speech 2 is: change (1)/contract (1)/lot (0)/supplement (1)/agreement (1);

transformation of the resulting vector of speech: v (1) {1, 1, 1, 0, 1 }; v (2) {1, 1, 0, 1, 1 };

then, the cosine similarity cos (v (1), v (2)) -0.828298 is calculated, and since the cosine similarity is not less than 0.6, it indicates that the voice of the interlocutor is similar to the voice of the host, and step S5 may be performed to play the voice of the interlocutor.

S4: extracting keywords from the voices of the interlocutor and the moderator respectively, calculating the semantic distance of the extracted keywords, judging whether the semantic distance is smaller than a semantic distance threshold value, if the semantic distance is not smaller than the semantic distance threshold value, performing step S5, and if the semantic distance is smaller than the semantic distance threshold value, terminating the step; wherein, steps S3 and S4 are performed synchronously.

For example: "doctor" has very low similarity to "disease" and very high correlation; the correlation between cars and gasoline is very high and the similarity is low, which indicates that the correlation and the similarity are not equal. Word similarity reflects the aggregate characteristics between words, while word relevance reflects the combined characteristics between words.

Specifically, a semantic distance calculation method based on ontology concepts is adopted, the hyponymy relation in the ontology is mainly applied, the ontology is an existing frame of semantic association, and semantic distances among some concepts can be directly obtained in the ontology, for example: the semantic distance for "animal" and "plant" is 2, the semantic distance for "mammal" and "reptile" is 2.

The method comprises the following specific steps: and (3) performing part-of-speech arrangement according to keywords in the voices of the interlocutor and the moderator, performing semantic distance comparison on the keywords with the same part-of-speech, namely comparing the semantic distances of two sentences of voices through a semantic distance formula, judging whether the semantic distance Dis (d1, d2) (wherein d1 and d2 are the keywords with the same part-of-speech in the voices of the interlocutor and the moderator respectively) is larger than a threshold value, if not, performing step S5, and if so, terminating the step. Different thresholds can be set according to the number of keywords, for example, when the number of keywords with the largest weight is selected to be 5, the threshold is set to be 3.

The semantic distance formula is the prior art and is obtained directly according to the shortest path obtained by the words in the semantic association hierarchical structure.

Wherein, steps S3 and S4 are performed synchronously.

S5: and displaying the voice content of the interlocutor.

Specifically, through the judgment of steps S3 and S4, if either judgment is successful, the present step S5 is required, which indicates that the voice of the interlocutor is similar to or related to the voice of the moderator, and the voice of the interlocutor can be played when the moderator pauses, otherwise, the voice of the interlocutor is not played.

The operation example of the method for extracting the keywords in the video conference is as follows:

example 1:

in a conference for once calling core backbone share right distribution, a department manager speaks that the share right excitation policy is more than 5 years of work of a company, the core backbone is purchased according to the original share price of 1 yuan per share, the company can trade on a market after 5 years of marketing, when the market trading is spoken, a party in the conference asks that the voice of a speaker is similar to the voice of a speaker 30 seconds before a node at the moment, because the similarity is less than 0.6, the judgment condition is not met, the semantic distance of a keyword is compared, the semantic distance is less than 3, and the judgment condition is met, and the voice of the speaker is played when the department manager stops.

Example 2:

in a live course, the lecture content of a voice detection teacher is started, keywords are extracted according to the voice content, the semantic distance is calculated according to the keywords, when a camera detects that 20% of expressions of students are puzzled, an alarm is given to remind the teacher that the knowledge is not understood, the time node is used as a trigger condition, the content of 30 seconds before detection is' a project manager is collecting required team members according to a plan, the notice is suddenly received by a personnel department of a company, the company just signs a labor agreement of 2 years with a company A, and all project members need to be input from the company from now. After interviewing by a project manager, company A does not have personnel who completely meet the requirements. How should the project manager do? And C, searching available resources from the company and training. We cannot use company a or re-recruit. None of the other options are met. "prove to be not understanding the classmates more to this question, the system alarms, prompt the teacher that this knowledge point has 20% of the classmates to understand, at this moment the teacher asks where the classmates are not clear according to the prompt, at this moment the system can gather the questions of the classmates by way of characters or pronunciation, zhang san" can not let company A recruit the staff that meet the requirements again according to the original plan "; and 2, Wang II: "how contract cannot be changed"; and fourthly, plum: "cannot re-sign supplemental protocol"; zhao Wu: "can not recruit again as originally required" ….

At the moment, intercepting the content of the suspected expression 30 seconds before carrying out automatic word segmentation, extracting key words, removing functional words and stop words, selecting 5 values with the weights arranged from large to small at most, calculating semantic distances from different problems, and sequencing the semantic distances as follows: zhang three (2), Zhao five (2), Wang two (4) and Li four (5). The cosine similarity is used for comparing the problems to obtain that the similarity between Zhang III and Zhao Wu is 0.8, the system extracts and selects a concise person between the Zhang III and the Zhao Wu for display, namely, a person with shorter speaking time is selected for voice playing, a person with shorter number of characters is selected for character display, and then the problems of Wang II and Li IV are listed. Teachers can selectively answer according to the relevancy of the questions, the first choice is more than the proportion of repeated questions, and then the arrangement is carried out on the principle of semantic distance (namely, the questions are arranged in proportion when repeated questions exist, and the questions are arranged according to the semantic distance when no repeated questions exist), so that the questions of the same students are extracted, simplified and screened through the relevancy and the similarity, the teacher-student time is saved, and the live classroom has a better learning effect.

Based on the same inventive concept, as shown in fig. 2, an embodiment of the present invention further provides an apparatus for refining a keyword in a video conference, for connecting with a conference terminal, including: a storage module 1, an identification module 2 and a microprocessor 3.

The storage module 1 is used for storing the voices of the interlocutor and the host;

the recognition module 2 is used for recognizing the voice of the interlocutor, and when recognizing the voice of the interlocutor, the recognition module sends the voice to the microprocessor 3 for processing;

the microprocessor 3 is adapted to process the voices of the interlocutor and moderator using the steps S1-S5 in the above-described method of refining keywords in a video conference, and transmit the result of the processing to the conference terminal.

The device for refining the keywords in the video conference further comprises a camera 4, wherein the camera 4 is used for performing expression recognition on participants in the conference, judging whether suspicious expressions occur or not, and if the suspicious expressions occur, sending a signal to the microprocessor 3 to process voices of a caller and a host by adopting the steps S1-S5 in the method for refining the keywords in the video conference.

Based on the same inventive concept, an embodiment also provides a non-transitory computer readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of a method for refining keywords in a video conference as described in the above embodiment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The inventive concept is explained in detail herein using specific examples, which are given only to aid in understanding the core concepts of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.

Claims

1. A method for refining keywords in a video conference is characterized by comprising the following steps:

s2: extracting keywords from the word after word segmentation;

s4: extracting keywords from the voices of the interlocutor and the moderator respectively, calculating the semantic distance of the extracted keywords, judging whether the semantic distance is smaller than a semantic distance threshold value, if the semantic distance is not smaller than the semantic distance threshold value, performing step S5, and if the semantic distance is smaller than the semantic distance threshold value, terminating the step;

wherein, the steps S3 and S4 are synchronously performed;

s5: and displaying the voice content of the interlocutor.

2. The method, apparatus and storage medium for refining keywords in video conference as claimed in claim 1, wherein in said step S1, the voice of the moderator selects the voice of the previous 30 seconds of the interlocutor' S interlocutor node.

3. The method, apparatus and storage medium for refining keywords in video conference according to claim 1, wherein the step S1 further comprises: the camera identifies the expressions of the participants in the conference, judges whether expression changes occur, if so, performs the next step S2, otherwise, repeats the step S1.

4. The method, apparatus and storage medium for extracting keywords in video conferencing of claim 1, wherein the step S2 is performed by combining part-of-speech tagging and TF-IDF weighting.

5. The method, apparatus and storage medium for refining keywords in video conference according to claim 4, wherein the step S2 includes:

s201: removing the null words through part-of-speech tagging;

s202: calculating the weight of the keyword by using a TF-IDF weight method;

6. The method, apparatus and storage medium for refining keywords in video conference according to claim 5, wherein the step S3 includes: and respectively extracting voices from the voices of the interlocutor and the host to obtain vectors of the voices so as to calculate the cosine similarity.

7. The method, apparatus and storage medium for refining keywords in video conference according to claim 6, wherein the step S3 further comprises: judging whether the keywords in the voices of the interlocutor and the moderator contain synonyms or antisense words, if the two voices contain synonyms or antisense words, adding 1 to the word frequency of the corresponding keywords in the word frequency vector used for generation in the voices of the interlocutor and the moderator.

8. The method, apparatus and storage medium for refining keywords in video conference according to claim 1, wherein the step S4 includes: performing part-of-speech arrangement on keywords in voices of a caller and a host, performing semantic distance comparison on the keywords with the same part-of-speech, and judging whether the semantic distance is greater than a threshold value;

9. A device for extracting keywords in a video conference is connected with a conference terminal, and is characterized by comprising: the device comprises a storage module, an identification module and a microprocessor;

the microprocessor is used for processing the voices of the interlocutor and the moderator by adopting the method for refining the keywords in the video conference as claimed in any one of the claims 1 to 8, and sending the processing result to the conference terminal.

10. A storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of refining keywords in a video conference according to any of the claims 1-8.