CN114697445B - Volume adjusting method, terminal and readable storage medium - Google Patents

Volume adjusting method, terminal and readable storage medium Download PDF

Info

Publication number
CN114697445B
CN114697445B CN202011638242.5A CN202011638242A CN114697445B CN 114697445 B CN114697445 B CN 114697445B CN 202011638242 A CN202011638242 A CN 202011638242A CN 114697445 B CN114697445 B CN 114697445B
Authority
CN
China
Prior art keywords
terminal
audio
received signal
specific space
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011638242.5A
Other languages
Chinese (zh)
Other versions
CN114697445A (en
Inventor
张敏
杨乐鹏
袁海飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202011638242.5A priority Critical patent/CN114697445B/en
Priority to PCT/CN2021/136096 priority patent/WO2022143040A1/en
Publication of CN114697445A publication Critical patent/CN114697445A/en
Application granted granted Critical
Publication of CN114697445B publication Critical patent/CN114697445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/309Measuring or estimating channel quality parameters
    • H04B17/318Received signal strength
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • H04M1/72454User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions according to context-related or environment-related conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/725Cordless telephones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/12Details of telephonic subscriber devices including a sensor for measuring a physical value, e.g. temperature or motion
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The application provides a volume adjusting method, electronic equipment, a terminal and a storable medium, and belongs to the technical field of terminals. The method is applied to the terminal and comprises the following steps: when a terminal is located in a specific space, the terminal acquires multimedia content played/collected by equipment in the specific space; the terminal plays the multimedia content and plays the audio data of the multimedia content at a first volume; detecting whether the terminal leaves the specific space; when the terminal is detected to leave the specific space, the terminal continues to play the multimedia content, and automatically adjusts the second volume larger than the first volume to play the audio data of the multimedia content. The method of the application realizes the automatic adjustment of the volume of the audio played by the terminal according to the detection result by detecting whether the terminal leaves the specific space.

Description

Volume adjusting method, terminal and readable storage medium
Technical Field
The present application relates to the field of terminal technologies, and in particular, to a volume adjustment method, an electronic device, a terminal, and a storable medium.
Background
Currently, terminal technology is rapidly developed and has been widely popularized and used. In addition to daily communications, users can also carry terminals, such as cell phones, with them while participating in a meeting. At the meeting, users may typically view and listen to meeting content through the meeting device. Of course, for convenience, as technology advances, users can synchronize conference content to play on terminals.
If a user joins a conference with a mobile phone, if the mobile phone is directly played in the conference room, other conference participants can be affected; however, if the mobile phone silently plays the conference content, the user needs to manually adjust the volume back and forth after leaving the conference room, so that the conference content can be prevented from being missed. For the above scenario, there is currently no suitable intelligent control strategy.
Disclosure of Invention
The embodiment of the application provides a volume adjusting method, electronic equipment, a terminal and a storable medium, which solve the problem that a user needs to manually adjust the playing volume of the terminal when entering and exiting different occasions.
In a first aspect, the present application provides a volume adjustment method, applied to a terminal, including: when a terminal is located in a specific space, firstly acquiring multimedia content played/collected by equipment in the specific space; then playing the audio data of the multimedia content at a first volume; detecting whether the terminal leaves the specific space or not; when the terminal is detected to leave the specific space, the terminal continues to play the multimedia content and automatically adjusts to play the audio data of the multimedia content at a second volume, wherein the second volume is larger than the first volume.
According to the embodiment of the application, whether the terminal leaves the specific space is detected, and when the terminal leaves the specific space, the terminal is automatically adjusted to the proper volume, so that the volume adjustment of the terminal is more intelligent, and the problem of poor experience caused by frequent manual volume adjustment is avoided.
Optionally, the first volume is mute, preventing volume play from affecting personnel in a specific space, such as meeting personnel in a conference room or film watching personnel in a film showing hall; the second volume is the volume before the terminal enters the specific space or the volume setting which is frequently used, and after the user leaves the specific space, the audio content is played at the volume which the user is accustomed to at ordinary times.
In one possible implementation, the detecting whether the terminal leaves the specific space includes: acquiring second audio data, wherein the second audio data is the audio data currently acquired by the terminal; comparing the similarity degree of the first audio data and the second audio data, and determining the audio matching degree, wherein the first audio data is the audio data of the multimedia content; based at least on the audio match, it is determined whether the terminal leaves the particular space.
According to the embodiment of the application, whether the terminal leaves the specific space is judged by comparing whether the audio data played in the specific space is matched with the audio data currently collected by the terminal, and then the terminal is automatically adjusted to the proper volume according to the judging result, so that the volume adjustment of the terminal is more intelligent, and the problem of poor experience caused by frequent manual volume adjustment is avoided.
In one possible implementation, the implementation of determining whether the terminal leaves the specific space based on at least the audio matching degree is as follows: and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
In one possible implementation, the implementation of determining whether the terminal leaves the specific space based on at least the audio matching degree is as follows: determining whether the terminal leaves a specific space based on at least the audio matching degree and the received signal strength mutation result; the received signal strength mutation result characterizes the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths sent by the communication devices in the specific space received by the terminal in continuous time.
Through two judging conditions of the audio matching degree and the received signal strength mutation result, whether the terminal leaves the specific space is judged more accurately, and the condition that the terminal does not leave the specific space and the volume of the terminal is adjusted to the second volume is avoided.
In another possible implementation, the implementation manner of determining whether the terminal leaves a specific space based on at least the audio matching degree and the received signal strength mutation result is: carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space; and if the P is larger than or equal to a preset probability value, determining that the terminal leaves the specific space.
In another possible implementation, the implementation of determining whether the terminal leaves the specific space based on at least the audio matching degree is as follows: determining whether the terminal leaves a specific space or not based on the audio matching degree, the received signal mutation result and the image recognition result; the received signal strength mutation result represents the probability of mutation of the received signal strength, and is determined based on a plurality of groups of received signal strengths, wherein the plurality of groups of received signal strengths are the signal strengths sent by the communication device in the specific space received by the terminal in continuous time; the image recognition result characterizes the probability that the terminal is located in the specific space, the probability is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
The judgment conditions of the image recognition result are further increased, whether the terminal leaves a specific space or not is judged through three judgment results of the audio matching degree, the received signal mutation result and the image recognition result, the judgment accuracy is further increased, and the use experience of a user is improved.
In another possible implementation manner, the determining whether the terminal leaves the specific space based on the audio matching degree, the received signal mutation result and the image recognition result is: the audio matching degree is smaller than a first threshold value, the received signal strength mutation result is larger than a second threshold value, and the image recognition result is smaller than a third threshold value; or the audio matching degree is smaller than the first threshold value and the received signal strength mutation result is larger than a second threshold value; or the audio matching degree is smaller than a first threshold value and the image recognition result is smaller than a third threshold value; or the received signal strength mutation result is larger than a second threshold value and the image recognition result is smaller than a third threshold value; the terminal is judged to leave the specific space.
In another possible implementation, the comparing the similarity between the first audio data and the second audio data, and determining the audio matching degree includes: acquiring the second audio data; aligning the first audio data and the second audio data; extracting continuous N frames of audio frames of the first audio data in a preset time period to obtain a first audio frame sequence; extracting N frames of audio frames aligned with the first audio frame sequence in the second audio data to obtain a second audio frame sequence; the N is a positive integer greater than or equal to 1; calculating the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence; and determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of the up-to-standard audio frames to N in the second audio frame sequence, wherein the up-to-standard audio frames are audio frames with the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence being larger than a fourth threshold value.
In another possible implementation manner, the calculating the similarity between each frame of the first audio frame sequence and the corresponding audio frame of the second audio frame sequence is implemented by: respectively extracting and obtaining a first characteristic vector representing each frame of audio frames in the first audio frame sequence and a second characteristic vector representing each frame of audio frames in the second audio frame sequence; and determining the similarity of each frame of audio frames in the first audio frame sequence and the corresponding audio frames in the second audio frame sequence based on the similarity of the first feature vector and the second feature vector.
In another possible implementation, the implementation manner of aligning the first audio data and the second audio data is: extracting a continuous M-frame audio frame sequence from a first moment in the second audio data to obtain a third audio frame sequence, and extracting a continuous M-frame audio frame sequence from a plurality of different second moments in the first audio data to obtain a plurality of fourth audio frame sequences, wherein the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1; determining a delay compensation parameter based on the similarity of the third audio frame sequence and the fourth audio frame sequence; the first audio data and the second audio data are aligned based on the delay compensation parameter.
In another possible implementation, the method further includes: acquiring multiple groups of continuous time received signal intensities; and determining the received signal strength mutation result based on the multiple groups of received signal strengths.
And determining a received signal strength mutation result by using a plurality of groups of received signal strengths, so that the accuracy of the received signal strength mutation result is prevented from being influenced by the abnormality of single received signal strength.
In another possible implementation manner, the determining the mutation result of the received signal strength based on the multiple sets of received signal strengths is implemented as follows: determining a feature vector representing the change feature of the received signal strength based on the difference value of the received signal strengths at adjacent moments; and inputting the feature vector into a preset prediction model, and determining the received signal strength mutation result.
In another possible implementation, the implementation manner of determining the feature vector characterizing the change feature of the received signal strength based on the difference value of the two sets of received signal strengths at adjacent time points is as follows: and determining a signal strength abrupt change feature vector based on whether the difference of the signal strengths of the adjacent moments is greater than a fifth threshold.
In another possible implementation, the plurality of sets of received signal strengths are a plurality of sets of X bluetooth received signal strengths, where the plurality of sets of X bluetooth received signal strengths are bluetooth signal strengths sent by X bluetooth devices in the specific space received by the terminal in a continuous time, and X is a positive integer greater than or equal to 3; and the received signal intensity of one group of the plurality of groups of received signal intensities is the Bluetooth signal intensity transmitted by X Bluetooth devices in the specific space received by the terminal at the same time.
According to the application, the Bluetooth signal intensity of the terminal in the specific space is used as the received signal intensity, any Bluetooth equipment is not required to be preset, the deployment cost is saved, and the application range is wider.
In another possible implementation manner, when the terminal is detected to leave the specific space, the terminal continues to play the multimedia content, and automatically adjusts to the second volume to play the audio data of the multimedia content, where the implementation manner is as follows: if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened; judging whether the condition that whether the face image of the target user collected by the front camera lasts for a preset time and whether the display page of the terminal is a designated display page or not is met; if yes, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content. The audio content is played when the user needs, and the method is more humanized.
In another possible implementation, the method further includes: if the terminal is detected to leave the specific space; collecting a first signal fingerprint, wherein the first signal fingerprint is determined based on the intensity of a received signal collected by the terminal at the current position; collecting a plurality of second signal fingerprints at a preset frequency, and when the second signal fingerprints are matched with the first signal fingerprints; and adjusting the volume of the terminal to a third volume, wherein the third volume is larger than the first volume and smaller than the second volume. It is realized that when a user is about to enter a specific space (such as a conference room or a movie studio), the volume is reduced, and the influence on the personnel in the specific space is avoided.
Optionally, the third volume is a smaller volume, for example, when the maximum volume of the terminal is 100, the third volume is 10.
In another possible implementation, the method further includes: if the terminal is detected to leave the specific space, continuing to detect whether the terminal enters the specific space or not; and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
In another possible implementation, the above multimedia data further includes display contents of devices in the specific space.
In another possible implementation, the method further includes: and adjusting a fourth threshold value, a delay compensation parameter and a fifth threshold value according to the audio matching degree, the received signal mutation result and the image recognition result.
In another possible implementation, according to the audio matching degree, the received signal mutation result and the image recognition result, the implementation manner of adjusting the fourth threshold value, the delay compensation parameter and the fifth threshold value is as follows: when the image recognition result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; or, the image recognition result is larger than or equal to a third threshold value, the received signal strength mutation result is smaller than or equal to a second threshold value, and the audio matching degree is smaller than the first threshold value; adjusting the delay compensation parameter and a fourth threshold;
When the image recognition result is smaller than the third threshold value and the audio matching degree is smaller than the first threshold value, the received signal strength mutation result is smaller than or equal to the second threshold value; or when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, and the received signal strength mutation result is greater than the second threshold value; the fifth threshold is adjusted. The mutual learning of the decision parameters of the audio matching degree and the decision parameters of the received signal strength mutation result is realized, the dynamic adjustment is carried out, and the judgment accuracy of whether the terminal leaves a specific space is improved.
In another possible implementation, when the image recognition result is less than the third threshold and the received signal strength abrupt change result is greater than the second threshold, the audio matching degree is greater than or equal to the first threshold; or, the image recognition result is larger than or equal to a third threshold value, the received signal strength mutation result is smaller than or equal to a second threshold value, and the audio matching degree is smaller than the first threshold value; the implementation manner of adjusting the delay compensation parameter and the fourth threshold is as follows: when the image recognition result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; then adjusting the fourth threshold to the audio matching degree smaller than the first threshold; when the image recognition result is larger than or equal to a third threshold value and the received signal strength mutation result is smaller than or equal to a second threshold value, the audio matching degree is smaller than the first threshold value; judging whether the audio similarity change rate is larger than a preset threshold value or not; if yes, the time delay compensation parameter is adjusted until the audio similarity change rate is smaller than a preset threshold value; if not, the fourth threshold value is reduced until the audio matching degree is larger than or equal to the first threshold value;
The audio similarity change rate is determined based on the ratio of the difference between the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a second moment and the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a third moment to the difference between the second moment and the third moment, wherein the second moment is a moment when the image recognition result is greater than or equal to a third threshold value and the received signal strength mutation result is less than or equal to a second threshold value, the audio matching degree is less than the first threshold value, and the third moment is a moment adjacent to the second moment.
In another possible implementation, when the image recognition result is smaller than the third threshold and the audio matching degree is smaller than the first threshold, the received signal strength mutation result is smaller than or equal to the second threshold; or when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, and the received signal strength mutation result is greater than the second threshold value; the implementation of adjusting the fifth threshold is: when the image recognition result is smaller than the third threshold value and the audio matching degree is smaller than the first threshold value, the received signal strength mutation result is smaller than or equal to the second threshold value; then decreasing the fifth threshold to a value where the received signal strength mutation result is greater than a second threshold; when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, the received signal strength mutation result is greater than the second threshold value; then the fifth threshold is scaled up to a value where the received signal strength mutation result is less than or equal to the second threshold.
In a second aspect, the present application also provides an electronic device, including: the terminal is used for acquiring multimedia content played/acquired by equipment in a specific space when the terminal is positioned in the specific space; the playing module is used for playing the multimedia content and playing the audio data of the multimedia content at a first volume; the detection module is used for detecting whether the terminal leaves the specific space; and the adjusting module is used for continuously playing the multimedia content when the terminal is detected to leave the specific space, and automatically adjusting the terminal to play the audio data of the multimedia content at a second volume, wherein the second volume is larger than the first volume.
In one possible implementation, the obtaining module is further configured to: acquiring second audio data, wherein the second audio data is the audio data currently acquired by the terminal; the detection module is also used for: comparing the similarity degree of the first audio data and the second audio data, and determining the audio matching degree, wherein the first audio data is the audio data of the multimedia content; based at least on the audio match, it is determined whether the terminal leaves the particular space.
In one possible implementation, the determining whether the terminal leaves the specific space based at least on the audio matching degree includes: and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based at least on the audio matching degree includes: determining whether the terminal leaves a specific space based on at least the audio matching degree and the received signal strength mutation result; the received signal strength mutation result characterizes the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths sent by the communication devices in the specific space received by the terminal in continuous time.
In another possible implementation, the determining whether the terminal leaves a specific space based at least on the audio matching degree and the received signal strength mutation result includes: carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space; and if the P is larger than or equal to a preset probability value, determining that the terminal leaves the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based at least on the audio matching degree includes: determining whether the terminal leaves a specific space or not based on the audio matching degree, the received signal mutation result and the image recognition result; the received signal strength mutation result represents the probability of mutation of the received signal strength, and is determined based on a plurality of groups of received signal strengths, wherein the plurality of groups of received signal strengths are the signal strengths sent by the communication device in the specific space received by the terminal in continuous time; the image recognition result characterizes the probability that the terminal is located in the specific space, the probability is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
In another possible implementation, the determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result and the image recognition result includes: the audio matching degree is smaller than a first threshold value, the received signal strength mutation result is larger than a second threshold value, and the image recognition result is smaller than a third threshold value; or the audio matching degree is smaller than the first threshold value and the received signal strength mutation result is larger than a second threshold value; or the audio matching degree is smaller than a first threshold value and the image recognition result is smaller than a third threshold value; or the received signal strength mutation result is larger than a second threshold value and the image recognition result is smaller than a third threshold value; the terminal is judged to leave the specific space.
In another possible implementation, the detection module is further configured to align the first audio data and the second audio data; extracting continuous N frames of audio frames of the first audio data in a preset time period to obtain a first audio frame sequence; extracting N frames of audio frames aligned with the first audio frame sequence in the second audio data to obtain a second audio frame sequence; the N is a positive integer greater than or equal to 1; calculating the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence; and determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of the up-to-standard audio frames to N in the second audio frame sequence, wherein the up-to-standard audio frames are audio frames with the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence being larger than a fourth threshold value.
In another possible implementation, the calculating the similarity between each frame of the first audio frame sequence and the corresponding audio frame of the second audio frame sequence includes: respectively extracting and obtaining a first characteristic vector representing each frame of audio frames in the first audio frame sequence and a second characteristic vector representing each frame of audio frames in the second audio frame sequence; and determining the similarity of each frame of audio frames in the first audio frame sequence and the corresponding audio frames in the second audio frame sequence based on the similarity of the first feature vector and the second feature vector.
In another possible implementation, the aligning the first audio data and the second audio data includes: extracting a continuous M-frame audio frame sequence from a first moment in the second audio data to obtain a third audio frame sequence, and extracting a continuous M-frame audio frame sequence from a plurality of different second moments in the first audio data to obtain a plurality of fourth audio frame sequences, wherein the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1; determining a delay compensation parameter based on the similarity of the third audio frame sequence and the fourth audio frame sequence; the first audio data and the second audio data are aligned based on the delay compensation parameter.
In another possible implementation, the detection module is further configured to obtain multiple sets of received signal strengths for a continuous time; and determining the received signal strength mutation result based on the multiple groups of received signal strengths.
In another possible implementation, the determining the received signal strength mutation result based on the multiple sets of received signal strengths includes: determining a feature vector representing the change feature of the received signal strength based on the difference value of the received signal strengths at adjacent moments; and inputting the feature vector into a preset prediction model, and determining the received signal strength mutation result.
In another possible implementation, the plurality of sets of received signal strengths are a plurality of sets of X bluetooth received signal strengths, where the plurality of sets of X bluetooth received signal strengths are bluetooth signal strengths sent by X bluetooth devices in the specific space received by the terminal in a continuous time, and X is a positive integer greater than or equal to 3; and the received signal intensity of one group of the plurality of groups of received signal intensities is the Bluetooth signal intensity transmitted by X Bluetooth devices in the specific space received by the terminal at the same time.
In another possible implementation, the adjustment module is further configured to: if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened; judging whether the condition that whether the face image of the target user collected by the front camera lasts for a preset time and whether the display page of the terminal is a designated display page or not is met; if yes, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content.
In another possible implementation, the apparatus further includes: the acquisition module is used for judging whether the terminal leaves the specific space; collecting a first signal fingerprint, wherein the first signal fingerprint is determined based on the intensity of a received signal collected by the terminal at the current position; collecting a plurality of second signal fingerprints at a preset frequency; the adjusting module is further used for matching the second signal fingerprint with the first signal fingerprint; and adjusting the volume of the terminal to a third volume, wherein the third volume is larger than the first volume and smaller than the second volume.
In another possible implementation, the detection module is further configured to: if the terminal is detected to leave the specific space, continuing to detect whether the terminal enters the specific space or not; and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
In another possible implementation, the multimedia content further includes display content of devices in the particular space.
In another possible implementation, the apparatus further includes a parameter optimization module configured to adjust a fourth threshold, a delay compensation parameter, and a fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result.
In another possible implementation, adjusting the fourth threshold, the delay compensation parameter, and the fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result includes: when the image recognition result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; or, the image recognition result is larger than or equal to a third threshold value, the received signal strength mutation result is smaller than or equal to a second threshold value, and the audio matching degree is smaller than the first threshold value; adjusting the delay compensation parameter and a fourth threshold;
When the image recognition result is smaller than the third threshold value and the audio matching degree is smaller than the first threshold value, the received signal strength mutation result is smaller than or equal to the second threshold value; or when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, and the received signal strength mutation result is greater than the second threshold value; the fifth threshold is adjusted. The mutual learning of the decision parameters of the audio matching degree and the decision parameters of the received signal strength mutation result is realized, the dynamic adjustment is carried out, and the judgment accuracy of whether the terminal leaves a specific space is improved.
In another possible implementation, when the image recognition result is less than the third threshold and the received signal strength abrupt change result is greater than the second threshold, the audio matching degree is greater than or equal to the first threshold; or, the image recognition result is larger than or equal to a third threshold value, the received signal strength mutation result is smaller than or equal to a second threshold value, and the audio matching degree is smaller than the first threshold value; adjusting the delay compensation parameter and a fourth threshold value, comprising: when the image recognition result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; then adjusting the fourth threshold to the audio matching degree smaller than the first threshold;
When the image recognition result is larger than or equal to a third threshold value and the received signal strength mutation result is smaller than or equal to a second threshold value, the audio matching degree is smaller than the first threshold value; judging whether the audio similarity change rate is larger than a preset threshold value or not; if yes, the time delay compensation parameter is adjusted until the audio similarity change rate is smaller than a preset threshold value; if not, the fourth threshold value is reduced until the audio matching degree is larger than or equal to the first threshold value;
the audio similarity change rate is determined based on the ratio of the difference between the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a second moment and the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a third moment to the difference between the second moment and the third moment, wherein the second moment is a moment when the image recognition result is greater than or equal to a third threshold value and the received signal strength mutation result is less than or equal to a second threshold value, the audio matching degree is less than the first threshold value, and the third moment is a moment adjacent to the second moment.
In another possible implementation, when the image recognition result is smaller than the third threshold and the audio matching degree is smaller than the first threshold, the received signal strength mutation result is smaller than or equal to the second threshold; or when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, and the received signal strength mutation result is greater than the second threshold value; adjusting the fifth threshold comprises: when the image recognition result is smaller than the third threshold value and the audio matching degree is smaller than the first threshold value, the received signal strength mutation result is smaller than or equal to the second threshold value; then decreasing the fifth threshold to a value where the received signal strength mutation result is greater than a second threshold; when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, the received signal strength mutation result is greater than the second threshold value; then the fifth threshold is scaled up to a value where the received signal strength mutation result is less than or equal to the second threshold.
In a third aspect, the present application further provides a terminal, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in the first aspect or any possible implementation manner of the first aspect.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method as described in the first aspect or any of the possible implementations of the first aspect.
In a fifth aspect, the application also provides a computer program or computer program product comprising instructions which, when executed, implement the method as described in the first aspect or any one of the possible implementations of the first aspect.
Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.
Drawings
FIG. 1 is an application scenario diagram of an embodiment of the present application;
fig. 2a is a flowchart of a volume adjustment method according to an embodiment of the present application;
Fig. 2b is a flow chart of a volume adjustment method according to another embodiment;
fig. 3 is a flowchart of a method for detecting whether a terminal leaves a conference room according to an embodiment of the present application;
FIG. 4 is a flow chart of audio matching determination;
FIG. 5 is a schematic diagram of MFCC feature extraction of audio data;
FIG. 6 is a schematic diagram of an alignment process of the first audio data and the second audio data;
fig. 7 is an RSSI acquisition schematic;
fig. 8 is a flowchart for determining the RSSI mutation result;
FIG. 9 is a flowchart of a method of volume adjustment in another embodiment;
FIG. 10 is a schematic diagram of parameter adjustment of audio matching degree in another embodiment;
FIG. 11 is a schematic diagram of parameter adjustment of the result of abrupt change of received signal strength according to another embodiment;
FIG. 12 is a diagram of another embodiment of the present application;
fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
The technical scheme of the application is further described in detail through the drawings and the embodiments.
The volume adjusting method and the terminal provided by the embodiment of the application can be applied to scenes when a user accesses a specific space requiring silence of the terminal, such as meeting rooms, movie projection halls and other places. The scheme of the embodiment of the application is specifically described below by taking a meeting room scene as an example.
Fig. 1 is an application scenario diagram of an embodiment of the present application. As shown in fig. 1, when a user is participating in a conference in which a plurality of conference participants participate, and the user needs to leave the conference room (for example, go to a toilet) briefly, but does not want to miss the conference content, the conference content can be synchronized to the terminal of the user for playing by establishing a connection with the conference device 1 through the terminal carried by the user. In order not to affect other meeting personnel, the meeting content is firstly mute and played, and when the user leaves the meeting room, the normal volume is automatically adjusted to play the meeting content. Therefore, the method and the device not only ensure that the user can not miss important conference contents, but also realize that the terminal intelligently adjusts the volume according to the scene, so that the voice interference is not caused to other conference participants, and the use experience of the user is improved.
Fig. 2a is a flowchart of a volume adjustment method according to an embodiment of the present application. As shown in fig. 2a, the method comprises the following steps:
s201, conference content data at least comprising conference audio data is acquired, and the conference audio data is played at a first volume.
When the user needs to leave the conference room (i.e., a specific space) temporarily, in order not to miss conference content, the user acquires conference content data (i.e., multimedia data) including at least conference audio data through the terminal and continues playing the acquired conference audio data at a first volume on the terminal.
It will be appreciated that there are various ways in which a terminal may establish a communication connection with a conference device to obtain conference content data, for example when a conference room has conference auxiliary devices that collect conference content data and synchronize the collected conference content to a particular terminal, the terminal may establish a connection with the conference auxiliary devices in a "bump-and-bump" manner to synchronize the conference content for playback to the terminal. Or the terminal establishes connection with the conference equipment in a mode of scanning the two-dimension code, and synchronizes conference content data to the terminal for playing. Or the terminal synchronizes the conference content data to the terminal to play by receiving the communication connection mode between the instruction and the specific conference equipment, and the like, and can be acquired by adopting any available acquisition mode according to actual conditions, and the application is not limited.
It should be explained that conference audio data includes audio data related to a conference, such as audio data played by a conference device, audio data generated by a discussion of the speech of conference participants, audio data generated by other conference devices (e.g., remote speech of remote conference participants played through speakers), etc.
In one example, in order not to affect the meeting personnel, the first volume is silent, or very small, e.g. the maximum volume is 100, the first volume is between 1-5, i.e. the first volume is preferably not perceived by the person at a close distance.
The terminal can be a terminal provided with a radio and cassette function, such as a smart phone, an intelligent wearable device, a tablet personal computer, a notebook computer, a palm computer, a personal digital assistant and the like.
The conference device may be a device related to a conference such as a display device having a communication function (e.g., a large screen display, displaying and playing conference content data), or a sound pickup device (e.g., a microphone, picking up audio data generated by the speech of conference participants).
S202, detecting whether the terminal leaves the conference room.
When the user acquires conference content data via the terminal (i.e. when the terminal establishes a communication connection with the conference device), it starts to detect whether or not it leaves the conference room itself. For example, when the terminal establishes a communication connection with the conference auxiliary device in a "bump-and-bump" manner, or when the terminal scans the two-dimensional code to establish a communication connection with the conference device, or when the terminal receives an instruction to establish a communication connection with the conference device, the terminal starts a built-in detection algorithm to detect whether the terminal leaves the conference room. See below for a specific scheme for detecting whether a terminal leaves a conference room.
And S203, when the terminal is detected to leave the conference room, automatically adjusting to be larger than the second volume to play the audio data of the conference content.
It can be understood that the second volume may be a volume before the terminal enters the conference room or a volume that is frequently used, after the user leaves the conference room, the conference audio data is played at a volume that is customary to the user, and of course, the second volume may also be a preset volume, for example, a maximum playing volume of the terminal is 100, and the second volume may be preset to be 50.
According to the volume adjusting method provided by the embodiment of the application, whether the terminal leaves the conference room is detected, and when the terminal leaves the conference room is detected, the terminal is automatically adjusted to a proper volume to play the audio data of the conference content, so that the volume adjustment of the terminal is more intelligent, and the problem of poor experience caused by frequent manual volume adjustment is avoided.
In another embodiment, to further enhance the user experience, it is achieved to turn up the volume again when the user desires. As shown in fig. 2b, when the terminal is judged to leave the conference room, the front-end camera of the terminal is controlled to be opened, and whether the front-end camera collects the face image of the target user for a preset time and the display page of the terminal is a conference playing page (namely a designated page) is judged; if yes, the user is indicated to have the requirement of acquiring the conference content, and at the moment, the volume of the terminal is increased to a second volume, namely, the normal volume is used for playing conference audio data.
Fig. 3 is a flowchart of a method for detecting whether a terminal leaves a conference room according to an embodiment of the present application. As shown in fig. 3, the method comprises the following steps:
s301, the terminal collects audio data of the current environment.
When the user acquires conference content data through the terminal, a pickup device (such as a microphone) of the terminal is controlled to start to collect audio data of the current environment.
S302, comparing the similarity degree of the audio data collected by the terminal and the audio data in the conference content, and determining the audio matching degree.
For convenience of description, conference audio data is represented as first audio data, and audio data currently collected by the terminal is represented as second audio data. The method for determining the audio matching degree by comparing the similarity degree of the audio data collected by the terminal and the audio data in the conference content is shown in fig. 4.
As shown in fig. 4, first, the first audio data and the second audio data are aligned in S401.
Specifically, a continuous M-frame audio frame sequence starting from a first moment in second audio data is firstly extracted to obtain a third audio frame sequence, a continuous M-frame audio frame sequence starting from a plurality of different second moments in the first audio data is extracted to obtain a plurality of fourth audio frame sequences, the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1. And then finding out a fourth audio frame sequence with highest similarity with the third audio frame sequence, wherein the difference value between the second moment and the first moment corresponding to the fourth audio frame sequence is the time delay compensation parameter. And finally, compensating the time delay of the first audio data based on the time delay compensation parameter, namely realizing the alignment of the first audio data and the second audio data.
The above calculation of the similarity of the third audio frame sequence and the fourth audio frame sequence may be obtained by calculating the similarity of the audio feature of each frame in the third audio frame sequence and the audio feature in the fourth audio frame sequence.
The audio features may be MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficient) features, FBank (FilterBank) features, etc., and are described herein as being MFCC features, the MFCC feature extraction process is described with reference to fig. 5.
In one example, the audio data is divided into 40 ms frames, where the frames are shifted by 10 ms, 12 frames can be extracted every 0.5 seconds, and feature vectors of 26 features can be extracted per frame. The MFCC feature vector of the i-th frame extracted from the first audio data is M i =[m i1 m i2 … m ip ]P=26 (i.e. first eigenvector), and the MFCC eigenvector of the j-th frame extracted from the second audio data is F j =[f j1 f j2 … f jp ]P=26 (i.e., second feature vector).
The similarity may be calculated by various methods, such as euclidean distance, cosine similarity, manhattan distance, etc., where the euclidean distance is taken as an example, and if the MFCC characteristics of each frame in the fourth audio frame sequence and the MFCC characteristics of each frame in the third speech frame sequence satisfy:i.e. the sum of the euclidean distances of the consecutive M frames is minimum, it is determined that the first audio data and the second audio data start to match and overlap, so as to obtain a Delay compensation parameter delay=Δt (i-1), where Δt is the frame length, i.e. 40 ms above, and the first audio data is compensated by using the Delay compensation parameter to align the first audio data with the second audio data (as shown in fig. 6).
In steps S402 and S403, consecutive N frames of audio frames of the first audio data for a preset period of time (e.g., 0.5 seconds) are extracted to obtain a first audio frame sequence and N frames of audio frames of the second audio data aligned with the first audio frame sequence are extracted to obtain a second audio frame sequence.
In step S404, the similarity between each frame of the first audio frame sequence and the corresponding audio frame of the second audio frame sequence is calculated. For the calculation method, see the description of the similarity calculation in step S401.
S405, determining the audio matching degree of the first audio data and the second audio data in a preset time period based on the ratio of the number of the up-to-standard audio frames to N in the second audio frame sequence.
And if the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence is larger than a fourth threshold value, judging that the audio frames reach the standard. Every 0.5 seconds is a moment, one moment has 12 (namely N) frames, and the audio matching degree P of the moment is calculated voicesimilarity =number of audio frames up to standard/12.
Returning to fig. 3, after the execution of step S302 is completed, step S303 is executed. In step S303, it is determined whether the terminal leaves the conference room based at least on the audio matching degree.
The audio matching degree refers to the similarity degree of two pieces of audio data, the higher the similarity degree of the two pieces of audio data is, the higher the matching degree is, and when the two pieces of audio data are completely identical, the audio matching degree is the highest at the moment. If the terminal is located in the conference room, the collected audio data necessarily includes conference audio data, and the conference audio data has a high similarity degree, that is, a high audio matching degree. When the terminal leaves the conference room, the audio data collected at present does not comprise conference audio data, so that the similarity degree of the collected audio data and the conference audio data is low, namely the audio matching degree is low. Therefore, it is possible to determine whether the terminal leaves the conference room using the audio matching degree. For example, when the audio matching degree is less than a first threshold, it is determined that the terminal leaves the conference room.
Compared with the first method, the method for detecting whether the terminal leaves the conference room is more accurate in judging whether the user leaves the conference room, the problem that the signal intensity of the wireless network frequently has abnormal values and the accuracy of determining whether the terminal leaves the conference room is low by means of the signal intensity of the method is solved, the situation that the user does not leave the conference room and the volume of the terminal is adjusted to play at normal volume is prevented, and user experience is improved.
In another embodiment, to further improve the accuracy of the terminal's determination of whether to leave the conference, the terminal determines whether the terminal left the conference room based on the audio match and the received signal strength (Received Signal Strength Indication, RSSI) abrupt change results.
The RSSI mutation result characterizes the probability of mutation of the received signal strength, and is determined based on multiple groups of RSSIs, wherein the multiple groups of RSSIs are the signal strengths sent by communication devices in a conference room received by a terminal in continuous time.
For example, because the functions of the mobile phone are strong, people hardly leave the body, bluetooth RSSI of mobile phones of multiple meeting participants in a conference room received by the terminal in continuous time can be used as multiple groups of RSSI (as shown in fig. 7), and each group of number X can be a larger value in the upper limit number of available bluetooth devices of the terminal and the number of searched mobile phone devices of the meeting participants, or is a number larger than a preset threshold value in the RSSI, so that the RSSI is guaranteed to have a certain strength, and the influence caused by abnormal fluctuation of the weaker RSSI is reduced.
In one example, a feature vector Y characterizing the RSSI change feature is determined based on the difference between two sets of RSSI at adjacent times, and Y is input into a trained predictive model that outputs the RSSI abrupt change result.
The feature vector Y may be extracted based on the difference between two sets of RSSI at adjacent times, for example, a set of RSSI data is collected every 0.5 seconds, and the RSSI of the jth bluetooth signal at the ith time is defined as r by taking 0.5 seconds as a time ij Feature vector Y at time i i =[Y i1 Y i2 … Y ij … Y Ix ]Wherein Y is ij The value method of (2) is as follows:
where Δj is the maximum allowable fluctuation error (i.e., fifth threshold) of the jth bluetooth RSSI.
It will be appreciated that the predictive model may learn the association between the respective inputs and outputs from the training data so that the corresponding outputs may be generated for a given input after training is completed. The predictive model may also be referred to as a predictive neural network, a learning model, a learning network, or the like. There are various predictive models, such as long and short term memory networks (Long Short Term Memory, LSTM), deep neural networks (Deep Neural Networks, DNN), convolutional neural networks (Convolutional Neural Networks, CNN), recurrent neural networks (Recurrent Neural Networks, RNN), etc.
Taking the predicted network as LSTM as an example for explanation, as shown in FIG. 8, the feature vector Y for representing the RSSI change feature is input into the trained LSTM to obtain the output RSSI mutation result. And determining the mathematical mapping relation between the feature vector Y and the RSSI through the trained LSTM.
And then carrying out weighted summation on the obtained RSSI mutation result and the audio matching degree to obtain the probability P that the terminal leaves the conference room, and determining that the terminal leaves the conference room when the P is larger than or equal to a preset probability value, otherwise, not leaving the conference room.
In another embodiment, an image recognition result is also obtained, and it is determined whether the terminal leaves the conference room based on the audio matching degree, the received signal mutation result, and the image recognition result.
For example, when the audio matching degree is smaller than the first threshold RSSI mutation result is larger than the second threshold, the image recognition result is smaller than the third threshold; or the audio matching degree is smaller than the first threshold value and the RSSI mutation result is larger than the second threshold value; or the audio matching degree is smaller than a first threshold value and the image recognition result is smaller than a third threshold value; or the received signal strength mutation result is larger than a second threshold value and the image recognition result is smaller than a third threshold value; and judging that the terminal leaves the conference room, otherwise, not leaving the conference room. Three judgment factors are added to judge whether the terminal leaves the conference room or not, and the judgment accuracy is further improved.
The image recognition result characterizes the probability that the terminal is located in the conference room, the probability is obtained based on the image of the user and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the user object. The image recognition model can be arranged in the terminal, for example, the terminal establishes communication connection with a camera in a conference room to acquire images acquired by the camera in real time, the acquired images are input into the image recognition model, and an image recognition result output by the image recognition model is obtained. Of course, the image recognition model may also be set in the conference auxiliary device, and the terminal directly establishes communication connection with the conference auxiliary device to obtain the image recognition result output by the image recognition model. The image recognition result is a mature prior art, and for brevity, description is omitted here.
Of course, it may also be determined whether the terminal leaves the conference room based on only the received signal strength abrupt change result, for example, when the received signal strength abrupt change result is greater than the second threshold value, it is determined that the terminal leaves the conference room.
Or, whether the terminal leaves the conference room is judged based on the image recognition result only, for example, when the image recognition result is smaller than a third threshold value, the terminal is determined to leave the conference room.
Or, whether the terminal leaves the conference room is judged based on the received signal strength mutation result and the image recognition result, for example, when the received signal strength mutation result and the image recognition result are weighted and summed, the probability P of the terminal leaving the conference room is determined 2 The method comprises the steps of carrying out a first treatment on the surface of the The P is 2 And if the probability value is larger than or equal to the preset probability value, determining that the terminal leaves the conference room. The signal intensity mutation result and the image recognition result are obtained by referring to the above description, and for brevity, a detailed description is omitted here.
In another embodiment, as shown in fig. 9, the volume adjustment method further includes: when the terminal is determined to leave the conference room, the RSSI of the current position is collected as a first signal fingerprint, then a plurality of second signal fingerprints are collected at preset frequency, when the second signal fingerprints are matched with the first signal fingerprints, the user is indicated to enter the conference room, at the moment, in order not to influence the conference staff in the conference room, the volume of the terminal is adjusted to a third volume, and the third volume is larger than the first volume and smaller than the second volume. The third volume is smaller, for example, when the upper volume limit of the terminal is 100, the third volume may be set to about 10, so that the user can hear but does not affect other people.
It will be appreciated that the signal fingerprint mentioned above is generated based on one or more of the wireless signal RSSI, such as bluetooth RSSI, wifi RSSI, etc. in the conference room.
In another embodiment, the volume adjustment method further comprises: when the second signal fingerprint is matched with the first signal fingerprint, whether the terminal enters the conference room is determined at least based on one of the audio matching degree, the signal intensity mutation result and the image recognition result, and if yes, the playing volume of the terminal is adjusted to the first volume, so that other conference participants in the conference room are prevented from being influenced.
The audio matching degree, the signal strength mutation result and the image recognition result obtaining manner are described above, and for brevity, description is omitted here.
Based on the audio matching degree, the implementation mode of determining whether the terminal enters the conference room is as follows: and if the audio matching degree is greater than or equal to the first threshold value, determining that the terminal enters the conference room.
Based on the signal strength mutation result, the implementation mode of determining whether the terminal enters the conference room is as follows: and if the signal strength mutation result is larger than the second threshold value, determining that the terminal enters the conference room.
Based on the image recognition result, the implementation mode of determining whether the terminal enters the conference room is as follows: and if the image recognition result is greater than or equal to a third threshold value, determining that the terminal enters the conference room.
Based on the audio matching degree and the signal strength mutation result, the implementation mode of determining whether the terminal enters the conference room is as follows: the audio matching degree and the signal strength mutation result are weighted and summed, and the probability P that the terminal enters the conference room is determined j ,P j And if the probability value is larger than or equal to the first preset probability value, determining that the terminal enters the conference room. Based on the audio matching degree and the image recognition result or the signal strength mutation result and the image recognition result, the implementation manner of determining that the terminal enters the conference room is similar to that, and the description is omitted here.
Based on the audio matching degree, the signal strength mutation result and the image recognition result, the implementation mode for determining whether the terminal enters the conference room is as follows: the audio matching degree is larger than or equal to a first threshold value, the received signal strength mutation result is larger than a second threshold value, and the image recognition result is larger than or equal to a third threshold value; or the audio matching degree is larger than or equal to a first threshold value and the received signal strength mutation result is larger than a second threshold value; or the audio matching degree is larger than or equal to a first threshold value and the image recognition result is larger than or equal to a third threshold value; or the received signal strength mutation result is larger than the second threshold value and the image recognition result is larger than or equal to the third threshold value; it is determined that the terminal enters the conference room.
It is to be understood that when conference contents are played back in a conference room using a conference device (e.g., a display having a communication function), the acquired conference contents also include display contents of the conference device, such as display contents of text information, image information, video information, and the like. Or the conference content also comprises image information or video information collected by a camera in the conference room, for example, in order to facilitate understanding, the information written by the conference participants on the blackboard is shot by the camera and synchronously played by the terminal.
In another embodiment, the volume adjustment method further includes: and adjusting the fourth threshold value, the time delay compensation parameter and the fifth threshold value based on the audio matching degree, the received signal mutation result and the image recognition result, namely realizing real-time adjustment, optimization and decision of the audio matching degree and the received signal mutation result, so that the method is more accurate.
Specifically, when the image recognition result is smaller than the third threshold value and the received signal strength mutation result is larger than the second threshold value, the audio matching degree is larger than or equal to the first threshold value; or, the image recognition result is larger than or equal to a third threshold value, the received signal strength mutation result is smaller than or equal to a second threshold value, and the audio matching degree is smaller than the first threshold value; the delay compensation parameter and a fourth threshold are adjusted.
Referring to fig. 10, if the audio matching degree is smaller than the first threshold, the audio similarity change rate K at that time is ij Whether or not it is greater than a preset threshold, i.e. K ij Whether the change is obvious, if so, the first audio data and the second audio data are not aligned, the last calculated delay compensation parameter is wrong, and the delay compensation parameter is adjusted to K ij Less than or equal to a preset threshold, i.e. K ij No significant changes exist; if not, the first audio data and the second audio data are aligned, and the requirement on the similarity is too strict, namely the fourth threshold value is too small, so that the standard-reaching audio frames are too few, and therefore the fourth threshold value is reduced until the audio matching degree is larger than or equal to the first threshold value.
Wherein K is ij Characterization of the rate of change of the Audio similarity, illustrating K by taking Euclidean distance characterization similarity as an example ij Is calculated by the following steps: euclidean distance D at the current time (i.e. second time) i Euclidean distance D at the previous time (third time) i-1 The time difference between the current time and the previous time is delta T, K ij =(D i -D i-1 )/ΔT。
If the audio matching degree is greater than or equal to the first threshold, the fourth threshold is too small, the similarity requirement is too low, and the standard audio frames are too many, so that the fourth threshold is increased until the audio matching degree is smaller than the first threshold. Therefore, the decision parameters (the delay compensation parameters and the fourth threshold) for optimizing the audio matching degree are dynamically adjusted in real time, so that whether the terminal is positioned in the conference room or not is judged more accurately.
When the image recognition result is smaller than the third threshold value and the audio matching degree is smaller than the first threshold value, the received signal strength mutation result is smaller than or equal to the second threshold value; or when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, and the received signal strength mutation result is greater than the second threshold value; the fifth threshold is adjusted.
Referring to fig. 11, if the RSSI abrupt change result is greater than the second threshold, it is indicated that the fifth threshold is too small, and the fifth threshold is adjusted to be smaller than or equal to the second threshold; otherwise, the fifth threshold is too large, and the fifth threshold is reduced until the RSSI abrupt change result is larger than the second threshold.
The mutual learning of the parameters of the audio matching degree and the parameters of the received signal strength is realized, and the dynamic adjustment is carried out, so that whether the terminal leaves/enters a conference room is judged more accurately, further, the voice adjustment is realized more accurately, and the user experience is improved.
The volume adjusting method is not limited to scenes in meeting room places, and can be applied to places requiring mute of terminals, such as movie projection halls, monitoring rooms and the like, when people leave the places, but do not want to miss information in the places, the information can be synchronized to the terminals for playing, mute playing is performed first when the terminals do not leave the places, and the volume of the terminals is controlled to be adjusted to normal volume when the terminals leave the places. The movie auditorium scene will be described as an example.
As shown in fig. 12, the large screen projection device 12 in the movie projection hall is playing a movie, when the user needs to leave the projection hall (for example, go to a bathroom), but does not want to miss the highlight, the user can scan the two-dimensional code on the chair through the portable terminal (for example, a mobile phone) to request to synchronize the content projected on the large screen projection device to the mobile phone for playing, meanwhile, the terminal judges whether to leave the projection hall by itself through a built-in algorithm, in order not to influence the viewing of other people, the terminal automatically plays with mute before leaving the projection hall, and automatically adjusts to normal volume for playing after judging that the terminal leaves the projection hall. When the terminal is about to enter the showing hall, the terminal automatically adjusts to play with smaller volume, and after entering the showing hall, the terminal automatically adjusts to mute play or closes play. The method of whether the terminal itself leaves/enters/is about to enter the auditorium is described above, and will not be repeated here.
Fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 13, the electronic device 130 includes at least:
an obtaining module 131, configured to obtain, when located in a specific space, multimedia content played/collected by a device in the specific space by the terminal;
A playing module 132, configured to play the multimedia content and play audio data of the multimedia content at a first volume;
a detection module 133 for detecting whether the terminal leaves the specific space;
and the adjusting module 135 is configured to, when detecting that the terminal leaves the specific space, continue playing the multimedia content by the terminal, and automatically adjust to play the audio data of the multimedia content at a second volume, where the second volume is greater than the first volume.
In one possible implementation, the obtaining module 131 is further configured to: acquiring second audio data, wherein the second audio data is the audio data currently acquired by the terminal;
the detection module 133 is further configured to: comparing the similarity degree of the first audio data and the second audio data, and determining the audio matching degree, wherein the first audio data is the audio data of the multimedia content; based at least on the audio match, it is determined whether the terminal leaves the particular space.
In one possible implementation, the determining whether the terminal leaves the specific space based at least on the audio matching degree includes: and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based at least on the audio matching degree includes: determining whether the terminal leaves a specific space based on at least the audio matching degree and the received signal strength mutation result; the received signal strength mutation result characterizes the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths sent by the communication devices in the specific space received by the terminal in continuous time.
In another possible implementation, the determining whether the terminal leaves a specific space based at least on the audio matching degree and the received signal strength mutation result includes: carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space; and if the P is larger than or equal to a preset probability value, determining that the terminal leaves the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based at least on the audio matching degree includes: determining whether the terminal leaves a specific space or not based on the audio matching degree, the received signal mutation result and the image recognition result; the received signal strength mutation result represents the probability of mutation of the received signal strength, and is determined based on a plurality of groups of received signal strengths, wherein the plurality of groups of received signal strengths are the signal strengths sent by the communication device in the specific space received by the terminal in continuous time; the image recognition result characterizes the probability that the terminal is located in the specific space, the probability is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
In another possible implementation, the determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result and the image recognition result includes: the audio matching degree is smaller than a first threshold value, the received signal strength mutation result is larger than a second threshold value, and the image recognition result is smaller than a third threshold value; or the audio matching degree is smaller than the first threshold value and the received signal strength mutation result is larger than a second threshold value; or the audio matching degree is smaller than a first threshold value and the image recognition result is smaller than a third threshold value; or the received signal strength mutation result is larger than a second threshold value and the image recognition result is smaller than a third threshold value; the terminal is judged to leave the specific space.
In another possible implementation, the detection module 133 is further configured to align the first audio data and the second audio data; extracting continuous N frames of audio frames of the first audio data in a preset time period to obtain a first audio frame sequence; extracting N frames of audio frames aligned with the first audio frame sequence in the second audio data to obtain a second audio frame sequence; the N is a positive integer greater than or equal to 1; calculating the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence; and determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of the up-to-standard audio frames to N in the second audio frame sequence, wherein the up-to-standard audio frames are audio frames with the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence being larger than a fourth threshold value.
In another possible implementation, the calculating the similarity between each frame of the first audio frame sequence and the corresponding audio frame of the second audio frame sequence includes: respectively extracting and obtaining a first characteristic vector representing each frame of audio frames in the first audio frame sequence and a second characteristic vector representing each frame of audio frames in the second audio frame sequence; and determining the similarity of each frame of audio frames in the first audio frame sequence and the corresponding audio frames in the second audio frame sequence based on the similarity of the first feature vector and the second feature vector.
In another possible implementation, the aligning the first audio data and the second audio data includes: extracting a continuous M-frame audio frame sequence from a first moment in the second audio data to obtain a third audio frame sequence, and extracting a continuous M-frame audio frame sequence from a plurality of different second moments in the first audio data to obtain a plurality of fourth audio frame sequences, wherein the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1; determining a delay compensation parameter based on the similarity of the third audio frame sequence and the fourth audio frame sequence; the first audio data and the second audio data are aligned based on the delay compensation parameter.
In another possible implementation, the detecting module 133 is further configured to obtain multiple sets of received signal strengths in a continuous time; and determining the received signal strength mutation result based on the multiple groups of received signal strengths.
In another possible implementation, the determining the received signal strength mutation result based on the multiple sets of received signal strengths includes: determining a feature vector representing the change feature of the received signal strength based on the difference value of the received signal strengths at adjacent moments; and inputting the feature vector into a preset prediction model, and determining the received signal strength mutation result.
In another possible implementation, the plurality of sets of received signal strengths are a plurality of sets of X bluetooth received signal strengths, where the plurality of sets of X bluetooth received signal strengths are bluetooth signal strengths sent by X bluetooth devices in the specific space received by the terminal in a continuous time, and X is a positive integer greater than or equal to 3; and the received signal intensity of one group of the plurality of groups of received signal intensities is the Bluetooth signal intensity transmitted by X Bluetooth devices in the specific space received by the terminal at the same time.
In another possible implementation, the adjustment module 135 is further configured to: if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened; judging whether the condition that whether the face image of the target user collected by the front camera lasts for a preset time and whether the display page of the terminal is a designated display page or not is met; if yes, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content.
In another possible implementation, the electronic device 130 further includes: an acquisition module 134, configured to, if the terminal leaves the specific space; collecting a first signal fingerprint, wherein the first signal fingerprint is determined based on the intensity of a received signal collected by the terminal at the current position; collecting a plurality of second signal fingerprints at a preset frequency; the adjustment module 135 is further configured to, when the second signal fingerprint matches the first signal fingerprint; and adjusting the volume of the terminal to a third volume, wherein the third volume is larger than the first volume and smaller than the second volume.
In another possible implementation, the detection module 133 is further configured to: if the terminal is detected to leave the specific space, continuing to detect whether the terminal enters the specific space or not; and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
In another possible implementation, the multimedia content further includes display content of devices in the particular space.
In another possible implementation, the electronic device 130 further includes a parameter optimization module 136 configured to adjust the fourth threshold, the delay compensation parameter, and the fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result.
In another possible implementation, adjusting the fourth threshold, the delay compensation parameter, and the fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result includes: when the image recognition result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; or, the image recognition result is larger than or equal to a third threshold value, the received signal strength mutation result is smaller than or equal to a second threshold value, and the audio matching degree is smaller than the first threshold value; adjusting the delay compensation parameter and a fourth threshold;
when the image recognition result is smaller than the third threshold value and the audio matching degree is smaller than the first threshold value, the received signal strength mutation result is smaller than or equal to the second threshold value; or when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, and the received signal strength mutation result is greater than the second threshold value; the fifth threshold is adjusted. The mutual learning of the decision parameters of the audio matching degree and the decision parameters of the received signal strength mutation result is realized, the dynamic adjustment is carried out, and the judgment accuracy of whether the terminal leaves a specific space is improved.
In another possible implementation, when the image recognition result is less than the third threshold and the received signal strength abrupt change result is greater than the second threshold, the audio matching degree is greater than or equal to the first threshold; or, the image recognition result is larger than or equal to a third threshold value, the received signal strength mutation result is smaller than or equal to a second threshold value, and the audio matching degree is smaller than the first threshold value; adjusting the delay compensation parameter and a fourth threshold value, comprising: when the image recognition result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; then adjusting the fourth threshold to the audio matching degree smaller than the first threshold;
when the image recognition result is larger than or equal to a third threshold value and the received signal strength mutation result is smaller than or equal to a second threshold value, the audio matching degree is smaller than the first threshold value; judging whether the audio similarity change rate is larger than a preset threshold value or not; if yes, the time delay compensation parameter is adjusted until the audio similarity change rate is smaller than a preset threshold value; if not, the fourth threshold value is reduced until the audio matching degree is larger than or equal to the first threshold value;
the audio similarity change rate is determined based on the ratio of the difference between the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a second moment and the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a third moment to the difference between the second moment and the third moment, wherein the second moment is a moment when the image recognition result is greater than or equal to a third threshold value and the received signal strength mutation result is less than or equal to a second threshold value, the audio matching degree is less than the first threshold value, and the third moment is a moment adjacent to the second moment.
In another possible implementation, when the image recognition result is smaller than the third threshold and the audio matching degree is smaller than the first threshold, the received signal strength mutation result is smaller than or equal to the second threshold; or when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, and the received signal strength mutation result is greater than the second threshold value; adjusting the fifth threshold comprises: when the image recognition result is smaller than the third threshold value and the audio matching degree is smaller than the first threshold value, the received signal strength mutation result is smaller than or equal to the second threshold value; then decreasing the fifth threshold to a value where the received signal strength mutation result is greater than a second threshold; when the image recognition result is greater than or equal to the third threshold value and the audio matching degree is greater than or equal to the first threshold value, the received signal strength mutation result is greater than the second threshold value; then the fifth threshold is scaled up to a value where the received signal strength mutation result is less than or equal to the second threshold.
The electronic device 130 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the electronic device 130 are respectively for implementing the corresponding flow of each method in fig. 2-11, which are not described herein for brevity.
It should be further noted that the embodiments described above are merely illustrative, and that the modules described as separate components may or may not be physically separate, and that components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the device embodiment drawings provided by the application, the connection relation between the modules represents that communication connection exists between the modules, and the connection relation can be specifically realized as one or more communication buses or signal lines.
The application also provides a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform any of the methods described above.
The application also provides a computer program or computer program product comprising instructions which, when executed, cause a computer to perform any of the methods described above.
The application also provides a terminal which comprises a memory and a processor, wherein executable codes are stored in the memory, and the processor executes the executable codes to realize any one of the methods.
Fig. 14 is a schematic structural diagram of a terminal provided by the present application.
As shown in fig. 14, the terminal 140 includes a processor 141, a memory 142, a bus 143, a microphone 144, a speaker 145, a display 146, and a communication interface 147. The processor 141, the memory 142, the microphone 144, the speaker 145, the display 146, and the communication interface 147 communicate via the bus 143, or may communicate via other means such as wireless transmission. The microphone 144 may collect audio data, such as first audio data; speaker 145 may play audio data, such as second audio data; the display 146 may display multimedia content such as conference image content, conference text content, conference video content, etc. displayed by a conference large screen device in a conference room; the communication interface 147 is used to make a communication connection with other communication devices, such as with a conference large screen device or with a projection large screen device in a projection hall; the memory 142 stores executable program codes, and the processor 141 may call the program codes stored in the memory 142 to perform the volume adjustment method in the foregoing method embodiment.
It should be appreciated that in embodiments of the present application, the processor 141 may be a central processing unit CPU, and the processor 141 may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 142 may include read only memory and random access memory and provides instructions and data to the processor 141. Memory 142 may also include non-volatile random access memory. For example, the memory 142 may also store training data sets.
The memory 142 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).
The bus 143 may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus. But for clarity of illustration, the various buses are labeled as bus 143 in the figures.
It should be understood that the terminal 140 according to the embodiment of the present application may correspond to an electronic device according to the embodiment of the present application, and may correspond to performing the respective main bodies in the methods shown in fig. 2 to 11 according to the embodiment of the present application, and that the foregoing and other operations and/or functions of the respective devices in the terminal 140 are respectively for implementing the respective flows of the respective methods of fig. 2 to 11, and are not repeated herein for brevity.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not considered to be beyond the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (27)

1. A volume adjustment method, applied to a terminal, comprising:
when the terminal is positioned in a specific space, the terminal acquires multimedia content played/acquired by other devices positioned in the specific space;
the terminal plays the multimedia content and plays the audio data of the multimedia content at a first volume;
Detecting whether the terminal leaves the specific space;
when the terminal is detected to leave the specific space, the terminal continues to play the multimedia content and automatically adjusts the second volume to play the audio data of the multimedia content, wherein the second volume is larger than the first volume;
the detecting whether the terminal leaves the specific space includes:
acquiring second audio data, wherein the second audio data is the audio data currently acquired by the terminal;
comparing the similarity degree of the first audio data and the second audio data, and determining the audio matching degree, wherein the first audio data is the audio data of the multimedia content;
based at least on the audio match, it is determined whether the terminal leaves the particular space.
2. The method of claim 1, wherein the determining whether the terminal leaves the particular space based at least on the audio matching degree comprises:
and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
3. The method of claim 1, wherein the determining whether the terminal leaves the particular space based at least on the audio matching degree comprises:
Determining whether the terminal leaves the specific space based on at least the audio matching degree and the received signal strength mutation result;
the received signal strength mutation result characterizes the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths sent by the communication devices in the specific space received by the terminal in continuous time.
4. The method of claim 3, wherein the determining whether the terminal leaves the particular space based at least on the audio matching degree and a received signal strength mutation result comprises:
carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space;
and if the P is larger than or equal to a preset probability value, determining that the terminal leaves the specific space.
5. The method of claim 1, wherein the determining whether the terminal leaves the particular space based at least on the audio matching degree comprises:
determining whether the terminal leaves the specific space based on the audio matching degree, the received signal mutation result and the image recognition result;
The received signal strength mutation result represents the probability of mutation of the received signal strength, and is determined based on a plurality of groups of received signal strengths, wherein the plurality of groups of received signal strengths are the signal strengths sent by the communication device in the specific space received by the terminal in continuous time;
the image recognition result characterizes the probability that the terminal is located in the specific space, the probability is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
6. The method of claim 5, wherein the determining whether the terminal leaves the specific space based on the audio matching degree, the received signal mutation result, and the image recognition result comprises:
the audio matching degree is smaller than a first threshold value, the received signal strength mutation result is larger than a second threshold value, and the image recognition result is smaller than a third threshold value;
or the audio matching degree is smaller than the first threshold value and the received signal strength mutation result is larger than a second threshold value;
or the audio matching degree is smaller than a first threshold value and the image recognition result is smaller than a third threshold value;
Or the received signal strength mutation result is larger than a second threshold value and the image recognition result is smaller than a third threshold value;
the terminal is judged to leave the specific space.
7. The method of claim 1, wherein comparing the similarity of the first audio data and the second audio data to determine the audio matching comprises:
aligning the first audio data and the second audio data;
extracting continuous N frames of audio frames of the first audio data in a preset time period to obtain a first audio frame sequence;
extracting N frames of audio frames aligned with the first audio frame sequence in the second audio data to obtain a second audio frame sequence; the N is a positive integer greater than or equal to 1;
calculating the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence;
and determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of the up-to-standard audio frames to N in the second audio frame sequence, wherein the up-to-standard audio frames are audio frames with the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence being larger than a fourth threshold value.
8. The method of claim 7, wherein said calculating the similarity of each audio frame in the first sequence of audio frames to a corresponding audio frame in the second sequence of audio frames comprises:
respectively extracting and obtaining a first characteristic vector representing each frame of audio frames in the first audio frame sequence and a second characteristic vector representing each frame of audio frames in the second audio frame sequence;
and determining the similarity of each frame of audio frames in the first audio frame sequence and the corresponding audio frames in the second audio frame sequence based on the similarity of the first feature vector and the second feature vector.
9. The method of claim 7, wherein said aligning said first audio data and second audio data comprises:
extracting a continuous M-frame audio frame sequence from a first moment in the second audio data to obtain a third audio frame sequence, and extracting a continuous M-frame audio frame sequence from a plurality of different second moments in the first audio data to obtain a plurality of fourth audio frame sequences, wherein the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1;
determining a delay compensation parameter based on the similarity of the third audio frame sequence and the fourth audio frame sequence;
The first audio data and the second audio data are aligned based on the delay compensation parameter.
10. A method according to claim 3, further comprising:
acquiring multiple groups of continuous time received signal intensities;
and determining the received signal strength mutation result based on the multiple groups of received signal strengths.
11. The method of claim 10, wherein said determining said received signal strength mutation result based on said plurality of sets of received signal strengths comprises:
determining a feature vector representing the change feature of the received signal strength based on the difference value of the received signal strengths at adjacent moments;
and inputting the feature vector into a preset prediction model, and determining the received signal strength mutation result.
12. The method of claim 10, wherein the plurality of sets of received signal strengths are a plurality of sets of X bluetooth received signal strengths, the plurality of sets of X bluetooth received signal strengths being bluetooth signal strengths transmitted by X bluetooth devices in the specific space received by the terminal in a continuous time, the X being a positive integer greater than or equal to 3;
and the received signal intensity of one group of the plurality of groups of received signal intensities is the Bluetooth signal intensity transmitted by X Bluetooth devices in the specific space received by the terminal at the same time.
13. The method of claim 1, wherein the terminal continues to play the multimedia content and automatically adjusts to play the audio data of the multimedia content at the second volume when the terminal is detected to leave the specific space, comprising:
if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened;
judging whether the condition that whether the face image of the target user collected by the front-end camera lasts for a preset time and whether the display page of the terminal is a designated display page or not is met;
if yes, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content.
14. The method as recited in claim 1, further comprising:
if the terminal is detected to leave the specific space;
collecting a first signal fingerprint, wherein the first signal fingerprint is determined based on the intensity of a received signal collected by the terminal at the current position;
collecting a plurality of second signal fingerprints at a preset frequency, and when the second signal fingerprints are matched with the first signal fingerprints;
and automatically adjusting the volume of the terminal to a third volume, wherein the third volume is larger than the first volume and smaller than the second volume, so as to play the audio data of the multimedia content.
15. The method as recited in claim 1, further comprising:
if the terminal is detected to leave the specific space, continuing to detect whether the terminal enters the specific space or not;
and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
16. The method of any of claims 1-15, wherein the multimedia content further comprises display content of devices in the particular space.
17. A terminal, comprising:
the acquisition module is used for acquiring multimedia content played/acquired by other devices positioned in the specific space when the device is positioned in the specific space;
the playing module is used for playing the multimedia content and playing the audio data of the multimedia content at a first volume;
the detection module is used for detecting whether the terminal leaves the specific space;
the adjusting module is used for continuously playing the multimedia content when the terminal is detected to leave the specific space, and automatically adjusting the terminal to play the audio data of the multimedia content at a second volume, wherein the second volume is larger than the first volume;
The acquisition module is further configured to: acquiring second audio data, wherein the second audio data is the audio data currently acquired by the terminal;
the detection module is also used for: comparing the similarity degree of the first audio data and the second audio data, and determining the audio matching degree, wherein the first audio data is the audio data of the multimedia content;
based at least on the audio match, it is determined whether the terminal leaves the particular space.
18. The terminal of claim 17, wherein the determining whether the terminal leaves the particular space based at least on the degree of audio matching comprises:
and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
19. The terminal of claim 17, wherein the determining whether the terminal leaves the particular space based at least on the degree of audio matching comprises:
determining whether the terminal leaves the specific space based on at least the audio matching degree and the received signal strength mutation result;
the received signal strength mutation result characterizes the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths sent by the communication devices in the specific space received by the terminal in continuous time.
20. The terminal of claim 19, wherein the determining whether the terminal leaves the particular space based at least on the audio matching degree and a received signal strength mutation result comprises:
carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space;
and if the P is larger than or equal to a preset probability value, determining that the terminal leaves the specific space.
21. The terminal of claim 17, wherein the determining whether the terminal leaves the particular space based at least on the degree of audio matching comprises:
determining whether the terminal leaves a specific space or not based on the audio matching degree, the received signal mutation result and the image recognition result;
the received signal strength mutation result represents the probability of mutation of the received signal strength, and is determined based on a plurality of groups of received signal strengths, wherein the plurality of groups of received signal strengths are the signal strengths sent by the communication device in the specific space received by the terminal in continuous time;
the image recognition result characterizes the probability that the terminal is located in the specific space, the probability is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
22. The terminal of claim 21, wherein the determining whether the terminal leaves the specific space based on the audio matching degree, the received signal mutation result, and the image recognition result comprises:
the audio matching degree is smaller than a first threshold value, the received signal strength mutation result is larger than a second threshold value, and the image recognition result is smaller than a third threshold value;
or the audio matching degree is smaller than the first threshold value and the received signal strength mutation result is larger than a second threshold value;
or the audio matching degree is smaller than a first threshold value and the image recognition result is smaller than a third threshold value;
or the received signal strength mutation result is larger than a second threshold value and the image recognition result is smaller than a third threshold value;
the terminal is judged to leave the specific space.
23. The terminal of claim 17, wherein the adjustment module is further configured to:
if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened;
judging whether the condition that whether the face image of the target user collected by the front-end camera lasts for a preset time and whether the display page of the terminal is a designated display page or not is met;
If yes, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content.
24. The terminal of claim 17, wherein the terminal further comprises:
the acquisition module is used for acquiring a first signal fingerprint if the terminal is detected to leave the specific space, wherein the first signal fingerprint is determined based on the intensity of a received signal acquired by the terminal at the current position;
collecting a plurality of second signal fingerprints at a preset frequency;
the adjusting module is further configured to automatically adjust a volume of the terminal to a third volume to play audio data of the multimedia content when the second signal fingerprint matches the first signal fingerprint, where the third volume is greater than the first volume and less than the second volume.
25. The terminal according to any of the claims 17-24, wherein the determining module is further configured to: if the terminal is detected to leave the specific space, continuing to detect whether the terminal enters the specific space or not;
and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
26. A terminal comprising a memory and a processor, wherein the memory has executable code stored therein, and the processor executes the executable code to implement the method of any of claims 1-16.
27. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-16.
CN202011638242.5A 2020-12-31 2020-12-31 Volume adjusting method, terminal and readable storage medium Active CN114697445B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011638242.5A CN114697445B (en) 2020-12-31 2020-12-31 Volume adjusting method, terminal and readable storage medium
PCT/CN2021/136096 WO2022143040A1 (en) 2020-12-31 2021-12-07 Volume adjusting method, electronic device, terminal, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011638242.5A CN114697445B (en) 2020-12-31 2020-12-31 Volume adjusting method, terminal and readable storage medium

Publications (2)

Publication Number Publication Date
CN114697445A CN114697445A (en) 2022-07-01
CN114697445B true CN114697445B (en) 2023-09-01

Family

ID=82135153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011638242.5A Active CN114697445B (en) 2020-12-31 2020-12-31 Volume adjusting method, terminal and readable storage medium

Country Status (2)

Country Link
CN (1) CN114697445B (en)
WO (1) WO2022143040A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116208700B (en) * 2023-04-25 2023-07-21 深圳市华卓智能科技有限公司 Control method and system for communication between mobile phone and audio equipment

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750258A (en) * 2011-03-30 2012-10-24 微软公司 Mobile device configuration based on status and location
CN103761063A (en) * 2013-12-19 2014-04-30 北京百度网讯科技有限公司 Method and device for controlling audio output volume in playing device
CN103888579A (en) * 2012-12-21 2014-06-25 中国移动通信集团广西有限公司 Method and device for adjusting beep volume of mobile terminal, and mobile terminal
CN104363563A (en) * 2014-11-24 2015-02-18 广东欧珀移动通信有限公司 Network-based mobile terminal voice control method and system
CN104616675A (en) * 2013-11-05 2015-05-13 华为终端有限公司 Method for switching playing equipment and mobile terminal
CN105592195A (en) * 2016-01-20 2016-05-18 努比亚技术有限公司 Volume adaptive adjusting method and apparatus
CN105607735A (en) * 2015-12-17 2016-05-25 深圳Tcl数字技术有限公司 Output controlling system and method of multimedia equipment
CN105827797A (en) * 2015-07-29 2016-08-03 维沃移动通信有限公司 Method for adjusting volume of electronic device and electronic device
CN106453860A (en) * 2016-09-26 2017-02-22 广东小天才科技有限公司 Switching method and device of sound mode and user terminal
CN107135308A (en) * 2017-04-26 2017-09-05 努比亚技术有限公司 Multimedia file plays audio control method, mobile terminal and readable storage medium storing program for executing
CN107172295A (en) * 2017-06-21 2017-09-15 上海斐讯数据通信技术有限公司 One kind control mobile terminal mute method and mobile terminal
CN107431860A (en) * 2015-03-12 2017-12-01 四达时代通讯网络技术有限公司 Audio system based on location-based service
CN107566888A (en) * 2017-09-12 2018-01-09 中广热点云科技有限公司 The audio setting method of multiple multimedia play equipments, multimedia play system
CN108156328A (en) * 2018-01-24 2018-06-12 维沃移动通信有限公司 The switching method and device of contextual model
CN108647005A (en) * 2018-05-15 2018-10-12 努比亚技术有限公司 Audio frequency playing method, mobile terminal and computer readable storage medium
CN108933914A (en) * 2017-05-24 2018-12-04 中兴通讯股份有限公司 A kind of method and system carrying out video conference using mobile terminal

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6330022B1 (en) * 1998-11-05 2001-12-11 Lucent Technologies Inc. Digital processing apparatus and method to support video conferencing in variable contexts
JP3661768B2 (en) * 2000-10-04 2005-06-22 インターナショナル・ビジネス・マシーンズ・コーポレーション Audio equipment and computer equipment
US20100153497A1 (en) * 2008-12-12 2010-06-17 Nortel Networks Limited Sharing expression information among conference participants
US20130024018A1 (en) * 2011-07-22 2013-01-24 Htc Corporation Multimedia control method and multimedia control system
KR20140099976A (en) * 2013-02-04 2014-08-14 삼성전자주식회사 Method and system for transmitting wirelessly video in portable terminal
CN105991710A (en) * 2015-02-10 2016-10-05 黄金富知识产权咨询(深圳)有限公司 Brief-report synchronous display content method and corresponding system
WO2018120115A1 (en) * 2016-12-30 2018-07-05 Arris Enterprises Llc Method and apparatus for controlling set top box volume based on mobile device events
CN109246383B (en) * 2017-07-11 2022-03-29 中兴通讯股份有限公司 Control method of multimedia conference terminal and multimedia conference server
CN109068088A (en) * 2018-09-20 2018-12-21 明基智能科技(上海)有限公司 Meeting exchange method, apparatus and system based on user's portable terminal

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750258A (en) * 2011-03-30 2012-10-24 微软公司 Mobile device configuration based on status and location
CN103888579A (en) * 2012-12-21 2014-06-25 中国移动通信集团广西有限公司 Method and device for adjusting beep volume of mobile terminal, and mobile terminal
CN104616675A (en) * 2013-11-05 2015-05-13 华为终端有限公司 Method for switching playing equipment and mobile terminal
CN103761063A (en) * 2013-12-19 2014-04-30 北京百度网讯科技有限公司 Method and device for controlling audio output volume in playing device
CN104363563A (en) * 2014-11-24 2015-02-18 广东欧珀移动通信有限公司 Network-based mobile terminal voice control method and system
CN107431860A (en) * 2015-03-12 2017-12-01 四达时代通讯网络技术有限公司 Audio system based on location-based service
CN105827797A (en) * 2015-07-29 2016-08-03 维沃移动通信有限公司 Method for adjusting volume of electronic device and electronic device
CN105607735A (en) * 2015-12-17 2016-05-25 深圳Tcl数字技术有限公司 Output controlling system and method of multimedia equipment
CN105592195A (en) * 2016-01-20 2016-05-18 努比亚技术有限公司 Volume adaptive adjusting method and apparatus
CN106453860A (en) * 2016-09-26 2017-02-22 广东小天才科技有限公司 Switching method and device of sound mode and user terminal
CN107135308A (en) * 2017-04-26 2017-09-05 努比亚技术有限公司 Multimedia file plays audio control method, mobile terminal and readable storage medium storing program for executing
CN108933914A (en) * 2017-05-24 2018-12-04 中兴通讯股份有限公司 A kind of method and system carrying out video conference using mobile terminal
CN107172295A (en) * 2017-06-21 2017-09-15 上海斐讯数据通信技术有限公司 One kind control mobile terminal mute method and mobile terminal
CN107566888A (en) * 2017-09-12 2018-01-09 中广热点云科技有限公司 The audio setting method of multiple multimedia play equipments, multimedia play system
CN108156328A (en) * 2018-01-24 2018-06-12 维沃移动通信有限公司 The switching method and device of contextual model
CN108647005A (en) * 2018-05-15 2018-10-12 努比亚技术有限公司 Audio frequency playing method, mobile terminal and computer readable storage medium

Also Published As

Publication number Publication date
CN114697445A (en) 2022-07-01
WO2022143040A1 (en) 2022-07-07

Similar Documents

Publication Publication Date Title
US11023690B2 (en) Customized output to optimize for user preference in a distributed system
US10743107B1 (en) Synchronization of audio signals from distributed devices
EP3963576B1 (en) Speaker attributed transcript generation
US11875796B2 (en) Audio-visual diarization to identify meeting attendees
US20210407516A1 (en) Processing Overlapping Speech from Distributed Devices
US10812921B1 (en) Audio stream processing for distributed device meeting
WO2021031308A1 (en) Audio processing method and device, and storage medium
WO2021244056A1 (en) Data processing method and apparatus, and readable medium
CN115482830B (en) Voice enhancement method and related equipment
WO2022253003A1 (en) Speech enhancement method and related device
CN114697445B (en) Volume adjusting method, terminal and readable storage medium
US11468895B2 (en) Distributed device meeting initiation
US20100266112A1 (en) Method and device relating to conferencing
CN109102813B (en) Voiceprint recognition method and device, electronic equipment and storage medium
WO2022147692A1 (en) Voice command recognition method, electronic device and non-transitory computer-readable storage medium
US11682412B2 (en) Information processing method, electronic equipment, and storage medium
CN114694685A (en) Voice quality evaluation method, device and storage medium
CN109102810B (en) Voiceprint recognition method and device
CN113707130A (en) Voice recognition method and device for voice recognition
CN115602162A (en) Voice recognition method and device, storage medium and electronic equipment
CN115297402A (en) Audio processing method, device and storage medium
CN112702672A (en) Adjusting method and device and earphone equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant