ISCA - International Speech Communication Association
ISCA Archive
Back to list of job offers
Senior Data Scientist at the University of Chicago
Please apply at https://uchicago.wd5.myworkdayjobs.com/External/job/Chicago-IL/Sr-Data-Scientist_JR24587
About the Department
The TMW Center for Early Learning + Public Health (TMW Center) develops science-based interventions, tools, and technologies to help parents and caregivers interact with young children in ways that maximize brain development. A rich language environment is critical to healthy brain development, however few tools exist to measure the quality or quantity of these environments. Access to this type of data allows caregivers to enhance interactions in real-time and gives policy-makers insight in how to best build policies that have a population-level impact. The wearable team within TMW Center is building a low-cost wearable device that can reliably and accurately measure a child’s early language environment vis-à-vis the conversational turns between a child and caregiver. The goal is to provide accurate, real-time feedback that empowers parents and caregivers to create the best language environment for their children.
Job Summary
The job works independently to perform a variety of activities relating to software support and/or development. Analyzes, designs, develops, debugs, and modifies computer code for end user applications, beta general releases, and production support. Guides development and implementation of applications, web pages, and user-interfaces using a variety of software applications, techniques, and tools. Solves complex problems in administration, maintenance, integration, and troubleshooting of code and application ecosystem currently in production. We are searching for a strategic and inquisitive senior data scientist to develop and optimize innovative AI-based models focused on speech/audio processing. The senior data scientist is expected to outline requirements, brainstorm ideas and solutions with leadership, manage data integrity and conduct experiments, assign tasks to junior staff, and monitor performance of the team.
Responsibilities
Minimum Qualifications
Education:
Minimum requirements include a college or university degree in related field.
--- Work Experience:
Minimum requirements include knowledge and skills developed through 5-7 years of work experience in a related job discipline.
--- Certifications:
---
Preferred Qualifications
Experience:
Technical Skills or Knowledge:
Application Documents
When applying, the document(s) MUST be uploaded via the My Experience page, in the section titled Application Documents of the application.
Title: Predictive Modeling of Subjective Disagreement in Speech Annotation/Evaluation Host laboratory : LIUM
Location : Le Mans
Supervisors : Meysam Shamsi, Anthony Larcher
Beginning of internship : February 2024
Application deadline : 10/01/2024
Keywords: Subjective Disagreement Modeling, Synthetic Speech Quality Evaluation, Speech Emotion Recognition In the context of modeling subjective tasks, where diverse opinions, perceptions, and judgments exist among individuals, such as in speech quality or speech emotion recognition, addressing the challenge of defining ground truth and annotating a training set becomes crucial. The current practice of aggregating all annotations into a single label for modeling a subjective task is neither fair nor efficient [1]. The variability in annotations or evaluations can stem from various factors [2], broadly categorized into those associated with corpus quality and those intrinsic to the samples themselves. In the first case, the delicate definition of a subjective task introduces sensitivity into the annotation process, potentially leading to more errors, especially where the annotation tools and platform lack precision or annotators experience fatigue. In the second case, the inherent ambiguity in defining a subjective task and different perception may result in varying annotations and disagreements. Developing a predictive model to understand annotator/evaluator disagreement is crucial for engaging in discussions related to ambiguous samples and refining the definition of subjective concepts. Furthermore, this model can serve as a valuable tool for assessing the confidence of automatic evaluations [3,4]. This modeling approach will contribute to the automatic evaluation of corpus annotations, identification of ambiguous samples for reconsideration or re-annotation, automatic assessment of subjective models, and the detection of underrepresented samples and biases in the dataset. The proposed research involves utilizing a speech dataset such as MS-Podcast [5], SOMOS [6], VoiceMOS [7], for a subjective task with multiple annotations per sample. The primary objective is to predict the variation in assigned labels, measured through disagreement scores, entropy, or distribution.
Reference: [1]. Davani, A. M., Díaz, M., & Prabhakaran, V. (2022). Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10, 92-110.
[2]. Kreiman, J., Gerratt, B. R., & Ito, M. (2007). When and why listeners disagree in voice quality assessment tasks. The Journal of the Acoustical Society of America, 122(4), 2354-2364.
[3]. Wu, W., Chen, W., Zhang, C., & Woodland, P. C. (2023). It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation. arXiv preprint arXiv:2310.00486.
[4]. Han, J., Zhang, Z., Schmitt, M., Pantic, M., & Schuller, B. (2017, October). From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 25th ACM international conference on Multimedia (pp. 890-897).
[5]. Lotfian, R., & Busso, C. (2017). Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4), 471-483.
[6]. Maniati, G., Vioni, A., Ellinas, N., Nikitaras, K., Klapsas, K., Sung, J.S., Jho, G., Chalamandaris, A., Tsiakoulis, P. (2022) SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis. Proc. Interspeech 2022, 2388-2392 [7]. Cooper, E., Huang, W. C., Tsao, Y., Wang, H. M., Toda, T., & Yamagishi, J. (2023). The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains. arXiv preprint arXiv:2310.02640.
Applicant profile : Candidate motivated by artificial intelligence, enrolled in a Master's degree in Computer Science or related fields
For application: Send CV + cover letter to : meysam.shamsi@univ-lemans.fr or anthony.larcher@univ-lemans.fr before 10/01/2024
ANR Project «REVITALISE»
Automatic speech analysis of public talks.
Description. Today, humanity has reached a stage at which extremely important aspects (such as information exchange) are tied not only to actual so-called hard skills, but also to soft skills. One such important skill is public speaking. Like many forms of interaction between people, the assessment of public speaking depends on many factors (often subjectively perceived). The goal of our project is to create an automatic system which can take into account these different factors and evaluate the quality of the performance. This requires understanding which elements can be assessed objectively and which vary depending on the listener [Hemamou, Wortwein, Chollet21]. For such an analysis, it is necessary to analyze public speaking at various levels: high-level (audio, video, text), intermediate (voice monotony, auto-gestures, speech structure, and etc.) and low-level (fundamental frequency, action units, POS / tags, and etc.) [Barkar]. This internship offers an opportunity to analyze the audio component of a public speech. The student is asked to solve two main problems. The engineering task is to create an automatic speech transcription system that detects speech disfluency. To do this, it is proposed to collect a bibliography on this topic and come up with an engineering solution. The second, research task, is to use audio cues to automatically analyze the success of a performance of a talk. This internship will give you an opportunity to solve an engineering problem as well as learn more about research approaches. By the end you will have expertise in audio processing as well and machine learning methods for multimodal analysis. If the internship is successfully completed, an article may be published. PhD position fundings on Social Computing will be accessible in the team at the end of the internship (at INRIA).
Registration & Organisation. Name of organization: Institut Polytechnique de Paris, Telecom-Paris Website of organization: https://www.telecom-paris.fr Department: IDS/LTCI/ Address: Palaiseau, France
Supervision. Supervision will include weekly meetings with the main supervisor and regular meetings (every 2-3 weeks) with co-supervisors. Telecom-Paris, 2023-2024 ANR Project «REVITALISE» Name of supervisor: Alisa BARKAR Name of co-supervisor: Chloe Clavel, Mathieu Chollet, Béatrice BIANCARDI Contact details: alisa.barkar@telecom-paris.fr
Duration & Planning. The internship is planned as a 5-6 month full-time internship for the spring semester 2024. 6 months considers 24 weeks within which it will be covered following list of activities:
● ACTIVITY 1(A1): Problem description and integration to the working environment
● ACTIVITY 2(A2): Bibliography overview
● ACTIVITY 3(A3): Implementation of the automatic transcription with detected discrepancies
● ACTIVITY 4(A4): Evaluation of the automatic transcription
● ACTIVITY 5(A5): Application of the developed methods to the existing data
● ACTIVITY 6(A6): Analysis of the importance of para-verbal features for the performance perception
● ACTIVITY 7(A7): Writing the report
Selected references of the team.
1. [Hemamou] L. Hemamou, G. Felhi, V. Vandenbussche, J.-C. Martin, C. Clavel, HireNet: a Hierarchical Attention Model for the Automatic Analysis of Asynchronous Video Job Interviews. in AAAI 2019, to appear
2. [Ben-Youssef] Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim. Ue-hri: a new dataset for the study of user engagement in spontaneous human-robot interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 464–472. ACM, 2017.
3. [Wortwein] Torsten Wörtwein, Mathieu Chollet, Boris Schauerte, Louis-Philippe Morency, Rainer Stiefelhagen, and Stefan Scherer. 2015. Multimodal Public Speaking Performance Assessment. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI '15). Association for Computing Machinery, New York, NY, USA, 43–50.
4. [Chollet21] Chollet, M., Marsella, S., & Scherer, S. (2021). Training public speaking with virtual social interactions: effectiveness of real-time feedback and delayed feedback. Journal on Multimodal User Interfaces, 1-13.
5. [Barkar] Alisa Barkar, Mathieu Chollet, Beatrice Biancardi, and Chloe Clavel. 2023. Insights Into the Importance of Linguistic Textual Features on the Persuasiveness of Public Speaking. In Companion Publication of the 25th International Conference on Multimodal Interaction (ICMI '23 Companion). Association for Computing Machinery, New York, NY, USA, 51–55. https://doi.org/10.1145/3610661.3617161 Telecom-Paris, 2023-2024 ANR Project «REVITALISE»
Other references.
1. Dinkar, T., Vasilescu, I., Pelachaud, C. and Clavel, C., 2020, May. How confident are you? Exploring the role of fillers in the automatic prediction of a speaker’s confidence. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8104-8108). IEEE.
2. Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, Radford A. et al., 2022, url: https://arxiv.org/abs/2212.04356
3. Romana, Amrit and Kazuhito Koishida. “Toward A Multimodal Approach for Disfluency Detection and Categorization.” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023): 1-5.
4. Radhakrishnan, Srijith et al. “Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition.” ArXiv abs/2310.06434 (2023): n. pag.
5. Wu, Xiao-lan et al. “Explanations for Automatic Speech Recognition.” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023): 1-5.
6. Min, Zeping and Jinbo Wang. “Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study.” ArXiv abs/2307.06530 (2023): n. pag.
7. Ouhnini, Ahmed et al. “Towards an Automatic Speech-to-Text Transcription System: Amazigh Language.” International Journal of Advanced Computer Science and Applications (2023): n. pag.
8. Bigi, Brigitte. “SPPAS: a tool for the phonetic segmentations of Speech.” (2023).
9. Rekesh, Dima et al. “Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition.” ArXiv abs/2305.05084 (2023): n. pag.
10. Arisoy, Ebru et al. “Bidirectional recurrent neural network language models for automatic speech recognition.” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015): 5421-5425.
11. Padmanabhan, Jayashree and Melvin Johnson. “Machine Learning in Automatic Speech Recognition: A Survey.” IETE Technical Review 32 (2015): 240 - 251.
12. Berard, Alexandre et al. “End-to-End Automatic Speech Translation of Audiobooks.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018): 6224-6228.
13. Kheir, Yassine El et al. “Automatic Pronunciation Assessment - A Review.” ArXiv abs/2310.13974 (2023): n. pag. Telecom-Paris, 2023-2024
Evaluation des systèmes de synthèse de la parole dans un environnement bruyant
Sujet L’´evaluation perceptive est capitale dans de nombreux domaines li´es au technologie de la parole dont la synth`ese de la parole. Elle permet d’´evaluer la qualit´e de la synth`ese de mani`ere subjective en demandant `a un jury[5] de noter la qualit´e d’un stimuli de parole synth´etis´ee[1, 2]. De r´ecent travaux ont permis de d´evelopper un mod`ele d’intelligence artificielle[3, 4] qui permet de pr´edire l’´evaluation subjective d’un segment de parole synth´etis´ee, ainsi permettant de s’affranchir d’un test par jury. Le probl`eme majeur de cette ´evaluation est l’interpr´etation du mot “qualit´e”. Certains peuvent baser leur jugement sur les caract´eristiques intrins`eques de la parole (tel que le timbre, le d´ebit de parole, la ponctuation, etc) alors que d’autres peuvent baser leur jugement sur les caract´eristiques li´es au signal audio (comme la pr´esence ou non de distorsion). Ainsi, l’´evaluation subjective de la parole peut ˆetre biais´ee par l’interpr´etation de la consigne par les auditeurs. Par cons´equent, le mod`ele d’intelligence artificielle mentionn´e ci-dessus peut ˆetre ainsi bas´e sur des mesures biais´ees. Le projet a pour but de r´ealiser un travail exploratoire pour ´evaluer la qualit´e de la synth`ese de la parole d’une mani`ere plus robuste que celle ayant ´et´e propos´e jusqu’ici. Pour ceci, nous partons de l’hypoth`ese que la qualit´e de la synth`ese de la parole peut ˆetre estim´ee par le biais de sa d´etection dans un environnement r´eel. En d’autre termes, un signal synth´etis´e parfaitement pour reproduire un signal de parole humaine ne devrait pas ˆetre d´etect´e dans un environnement de la vie quotidienne. Bas´e sur cette hypoth`ese, nous proposons donc de monter une exp´erience de perception de la parole en milieu bruyant. Il existe des m´ethodes de reproduction de milieu sonore qui permettent de simuler un environnement existant au casque. L’avantage de ces m´ethodes c’est qu’il est ´egalement possible de jouer un enregistrement d’un milieu r´eel au casque tout en ajoutant des signaux comme s’il avait ´et´e pr´esent dans la sc`ene sonore enregistr´ee. Ceci implique d’une part une campagne de mesure acoustique dans des environnement bruyant de la vie quotidienne (transport, open space, cantine, etc). Ensuite, une g´en´eration de parole synth´etis´ee sera n´ecessaire tout en prenant en compte le contexte des enregistrements. Il sera ´egalement pertinent de faire varier les param`etres de la parole synth´etis´ee tout en gardant la mˆeme s´emantique. Les enregistrements de la vie quotidienne seront ensuite mix´es aux signaux de parole synth´etis´ee pour ´evaluer la d´etection de cette derni`ere. Nous utiliserons le pourcentage de fois que la parole synth´etis´ee sera d´etect´ee comme indicateur de qualit´e. Ces pourcentages de d´etection seront ensuite compar´es au pr´ediction du mod`ele d’intelligence artificielle mentionn´e ci-dessus. Ainsi, nous pourrons conclure (1) si les m´ethodes sont ´equivalentes ou compl´ementaires et (2) quel(s) param`etre(s) de la parole synth´etis´ee engendre une d´etection de cette derni`ere en milieu bruyant.
Informations compl´ementaires:
• Encadrement: Le stage sera co-encadr´e par Aghilas Sini, maˆıtre de conf´erence au Laboratoire d’Informatique de l’Universit´e du Mans (aghilas.sini@univ-lemans.fr) et Thibault Vicente, maˆıtre de conf´erence au Laboratoire d’Acoustique de l’Universit´e du Mans (thibault.vicente@univ-lemans.fr)
• Niveau requis: Stage de M2 recherche
• P´eriode envisag´ee: 6 mois (F´evrier `a Juillet 2024)
• Lieu: Le Mans Universit´e
• mots-cl´es: parole synth´etis´ee, synth`ese sonore binaurale, test par jury
References
[1] Y.-Y. Chang. Evaluation of tts systems in intelligibility and comprehension tasks. In Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing (ROCLING 2011), pages 64–78, 2011.
[2] J. Chevelu, D. Lolive, S. Le Maguer, and D. Guennec. Se concentrer sur les diff´erences: une m´ethode d’´evaluation subjective efficace pour la comparaison de syst`emes de synth`ese (focus on differences: a subjective evaluation method to efficiently compare tts systems*). In Actes de la conf´erence conjointe JEP-TALN-RECITAL 2016. volume 1: JEP, pages 137–145, 2016.
[3] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang. MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion. In Proc. Interspeech 2019, pages 1541–1545, 2019.
[4] G. Mittag and S. M¨oller. Deep learning based assessment of synthetic speech naturalness. arXiv preprint arXiv:2104.11673, 2021.
[5] M. Wester, C. Valentini-Botinhao, and G. E. Henter. Are we using enough listeners? no!—an empirically-supported critique of interspeech 2014 tts evaluations. In 16th Annu. Conf. Int. Speech Commun. Assoc., 2015.
M2 Master Internship
Automatic Alsatian speech recognition
1 Supervisors
Name: Emmanuel Vincent
Team and lab: Multispeech team, Inria research center at Université de Lorraine, Nancy
Email: emmanuel.vincent@inria.fr
Name: Pascale Erhart
Team and lab: Language/s and Society team, LiLPa, Strasbourg
Email: pascale.erhart@unistra.fr
2 Motivation and context This internship is part of the Inria COLaF project (Corpora and tools for the languages of France), whose objective is to develop and disseminate inclusive language corpora and technologies for regional languages (Alsatian, Breton, Corsican, Occitan, Picard, etc.), overseas languages and non-territorial immigration languages of France. With few exceptions, these languages are largely ignored by language technology providers [1]. However, such technologies are keys to the protection, promotion and teaching of these languages. Alsatian is the second regional language spoken in France in terms of number of speakers, with 46% of Alsace residents saying they speak Alsatian fairly well or very well [2]. However, it remains an underresourced language in terms of data and language technologies. Attempts at machine translation have been made as well as data collection [3].
3 Objectives The objective of the internship is to design an automatic speech recognition system for Alsatian based on sound archives (radio, television, web, etc.). This raises two challenges: i) Alsatian is not a homogeneous language but a continuum of dialectal varieties which are not always written in a standardized way, ii) the textual transcription is often unavailable or differs from the pronounced words (transcription errors , subtitles, etc.). Solutions will be based on i) finding a suitable methodology for choosing and preparing data, ii) designing an automatic speech recognition system using end-to-end neural networks which can rely on the adaptation of an existing multilingual system like Whisper [4] in a self-supervised manner from a number of untranscribed recordings [5] and in a supervised manner from a smaller number of transcribed recordings, or even from text-only data [6]. The work will be based on datasets collected by LiLPa and the COLaF project’s engineers, which include the television shows Sunndi's Kater [7] and Kùmme Mit [8] whose dialogues are scripted, some radio broadcasts from the 1950s–1970s with their typescripts [9], as well as untranscribed radio broadcasts of France Bleu Elsass. Dictionaries of Alsatian such as the Wörterbuch der elsässischen Mundarten which can be consulted via the Woerterbuchnetz portal [10] or phonetization initiatives [11] could be exploited, for example using Orthal spelling [12]. The internship opens the possibility of pursuing a PhD thesis funded by the COLaF project.
4 Bibliography
[1] DGLFLF, Rapport au Parlement sur la langue française 2023, https://www.culture.gouv.fr/Media/Presse/Rapport-au-Parlement-surla-langue-francaise-2023
[2] https://www.alsace.eu/media/5491/cea-rapport-esl-francais.pdf
[3] D. Bernhard, A-L Ligozat, M. Bras, F. Martin, M. Vergez-Couret, P. Erhart, J. Sibille, A. Todirascu, P. Boula de Mareüil, D. Huck, “Collecting and annotating corpora for three under-resourced languages of France: Methodological issues”, Language Documentation & Conservation, 2021, 15, pp.316-357.
[4] A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, “Robust speech recognition via large-scale weak supervision”, in 40th International Conference on Machine Learning, 2023, pp. 28492-28518.
[5] A. Bhatia, S. Sinha, S. Dingliwal, K. Gopalakrishnan, S. Bodapati, K. Kirchhoff, “Don't stop selfsupervision: Accent adaptation of speech representations via residual Adapters”, in Interspeech, 2023, pp. 3362-3366.
[6] N. San, M. Bartelds, B. Billings, E. de Falco, H. Feriza, J. Safri, W. Sahrozi, B. Foley, B. McDonnell, D. Jurafsky, “Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions”, in 6th Workshop on Computational Methods for Endangered Languages, 2023, pp. 1-6.
[7] https://www.france.tv/france-3/grand-est/sunndi-s-kater/ [8] https://www.france.tv/france-3/grand-est/kumme-mit/toutes-les-videos/ [9] https://www.ouvroir.fr/cpe/index.php?id=1511 [10] https://woerterbuchnetz.de/?sigle=ElsWB#0 [11] 10.5281/zenodo.1174213 [12] https://orthal.fr/
5 Profile MSc in speech processing, natural language processing, computational linguistics, or computer science. Strong programming skills in Python/Pytorch. Knowledge of Alsatian and/or German is a plus, but in no way a prerequisite
-- Post-doctoral research position - L3i - La Rochelle France
---------------------------------------------------------------------------------------------------------------------------
Title : Emotion detection by semantic analysis of the text in comics speech balloons
The L3i laboratory has one open post-doc position in computer science, in the specific field of natural language processing in the context of digitised documents.
Duration: 12 months (an extension of 12 months will be possible)
Position available from: as soon as possible
Salary: approximately 2100 € / month (net)
Place: L3i lab, University of La Rochelle, France
Specialty: Computer Science/ Document Analysis/ Natural Language Processing
Contact: Jean-Christophe BURIE (jcburie [at] univ-lr.fr) / Antoine Doucet (antoine.doucet [at] univ-lr.fr)
Position Description
The L3i is a research lab of the University of La Rochelle. La Rochelle is a city in the south west of France on the Atlantic coast and is one of the most attractive and dynamic cities in France. The L3i works since several years on document analysis and has developed a well-known expertise in ‘Bande dessinée”, manga and comics analysis, indexing and understanding.
The work done by the post-doc will take part in the context of SAiL (Sequential Art Image Laboratory) a joint laboratory involving L3i and a private company. The objective is to create innovative tools to index and interact with digitised comics. The work will be done in a team of 10 researchers and engineers.
The team has developed different methods to extract and recognise the text of the speech balloons. The specific task of the recruited researcher will be to use Natural Language Processing strategies to analyse the text in order to identify emotions expressed by a character (reacting to the utterance of another speaking character) or caused by it (talking to another character). The datasets will be collections of comics in French and English.
Qualifications
Candidates must have a completed PhD and a research experience in natural language processing. Some knowledge and experience in deep learning is also recommended.
General Qualifications
• Good programming skills mastering at least one programming language like Python, Java, C/C++
• Good teamwork skills
• Good writing skills and proficiency in written and spoken English or French
Applications
Candidates should send a CV and a motivation letter to jcburie [at] univ-lr.fr and antoine.doucet [at] univ-lr.fr.
Master 2 Internship Proposal
Advisors: Jules Cauzinille, Benoˆıt Favre, Arnaud Rey
November 2023
Deep transfer knowledge from speech to primate vocalizations
Keywords: Computational bioacoustics, deep learning, self-supervised learning, transfer knowledge, efficient fine-tuning, primate vocalizations
1 Context This internship takes part in a multidisciplinary research project aimed at bridging the gap between state of the art deep leaning methods developed for speech processing and computational bioacoustics. Computational bioacoustics is a relatively new research filed which proposes to tackle the study of animal acoustic communication with computational approaches Stowell [2022]. Recently, bioacousticians are showing increasing interest for the deep learning revolution embodied in transformer architectures and self-supervised pre-trained models, but much investigation still needs to be carried out. We propose to test the viability of self-supervision and knowledge transfer as a bioacoustic tool by pre-training models on speech and using them for primate vocalisation analysis.
2 Problem Statement Speech based models are able to reach convincing performance on primate-related tasks including segmentation, individual identification or call type classification Sarkar and Doss [2023] as they are with many different downstream tasks (such as vocal emotion recognition Wang et al. [2021]). We have tested publicly available models such as HuBERT Hsu et al. [2021] and Wav2Vec2 [Schneider et al., 2019], two self-supervised speech-based architectures, on some of these tasks with Gibbon vocalizations. Our method involves probing and traditional fine-tuning of these models.
As to ensure true knowledge transfer from pre-training speech datasets to the downstream classification tasks, the goal of this internship will be to implement efficient fine-tuning methods in a similar fashion. These will allow to limit and control the amount of information lost in the finetuning process. Depending on the interests of the candidate, the methods can include prompt tuning Lester et al. [2021], attention prompting Gao et al. [2023], low rank adaptation Hu et al. [2021] or adversarial reprogramming Elsayed et al. [2018]. The candidate will also be free to explore other methods relevant to the question at hand, either on Gibbons or other species data-sets currently being collected.
3 Profile The intern will propose and implement the efficient fine-tuning solutions on an array of (preferably self-supervised) acoustic models pre-trained on speech or general sound such as HuBERT, Wav2vec, WavLM, VGGish, etc. Exploring adversial re-programming of models pre-trained on other modalities (images, videos, etc.) could also be carried out. The work will be implemented using pytorch.The candidate must have the following qualities :
• Excellent knowledge of deep learning methods
• Extensive experience with PyTorch models
• An interest in processing bioacoustic data
• An interest in reading and writing scientific papers as well as some curiosity for
research challenges
The internship will last 6 months at the LIS and LPC laboratories in Marseille during spring 2024.
The candidate will work in close collaboration with Jules Cauzinille as part of his thesis on “Self-supervised learning for primate vocalization analysis”. The candidate will also be in contact with the researchers community of the ILCB.
4 Contact Please send a CV, transcripts and a letter of application to jules.cauzinille@lis- lab.fr, benoit.favre@lislab.fr, and arnaud.rey@cnrs.fr. Do not hesitate to contact us if you have any question (or if you want to hear what our primates sound like).
Gamaleldin F. Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial reprogramming of neural networks, 2018.
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model, 2023.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP:1–1, 2021. doi: 10.1109/TASLP.2021.3122291.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021.
Eklavya Sarkar and Mathew Magimai Doss. Can Self-Supervised Neural Networks Pre-Trained on Human Speech distinguish Animal Callers?, May 2023. arXiv:2305.14035 [cs, eess].
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469, 2019. doi: 10.21437/Interspeech.2019-1873.
Dan Stowell. Computational bioacoustics with deep learning: a review and roadmap. 10:e13152, 2022. ISSN 2167-8359. doi: 10.7717/peerj.13152. URL https://peerj.com/articles/13152.
Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. CoRR, abs/2111.02735, 2021. doi: 10.48550/arXiv.2111.02735
PhD Title: SUMMA-Sound : SUMMarization of Activities of daily living using Sound-based activity recognition Partnership:
IMT Atlantique : Campus ☒ Brest ☐ Nantes ☐ Rennes Laboratory : Lab-STICC Doctoral school : ☒ SPIN ☐ 3MG Funding: IMT Atlantique, co-tutelle with Instituto Superior Técnico
Context : IMT Atlantique, internationally recognised for the quality of its research, is a leading general engineering school under the aegis of the French Ministry of Industry and Digital Technology, ranked in the three main international rankings (THE, SHANGHAI, QS). Located on three campuses, Brest, Nantes and Rennes, IMT Atlantique aims to combine digital technology and energy to transform society and industry through training, research and innovation. It aims to be the leading French higher education and research institution in this field on an international scale. With 290 researchers and permanent lecturers, 1000 publications and 18 M€ of contracts, it supervises 2300 students each year and its training courses are based on cutting-edge research carried out within 6 joint research units: GEPEA, IRISA, LATIM, LABSTICC, LS2N and SUBATECH. The proposed thesis is part of the research activities of the team RAMBO (Robot interaction, Ambient systems, Machine learning, Behaviour, Optimization) and of the laboratory Lab-STICC and the department of Computer Science of IMT Atlantique. Scientific context: The objective of this thesis is to develop a method for collecting and summarizing domestic health-related data relevant for medical diagnosis, in a non-intrusive manner using audio information. This research addresses the lack of existing practical tools for providing high-level succinct information to medical staff on the evolution of patients they follow for health diagnostic purposes. This research is based on the assumption that valuable diagnostic data can be collected by observing short- and long-term lifestyle changes and behavioural anomalies. It relies on the latest advances in the domains of audio-based activity recognition, summarization of human activity, and health diagnosis. Research on health diagnosis in domestic environments has already explored a variety of sensors and modalities for gathering data on human health indicators [5]. Nevertheless, audio-based activity recognition is notable for its less intrusive nature. Employing state-of-the-art sound-based activity recognition models [2] to monitor domestic human activity, the thesis will investigate and develop methods for summarization of human activity [3] in a human-understandable language, in order to produce easily interpretable data by doctors who, remotely, monitor their patients [4]. This work continues and fosters the research of the RAMBO team at IMT Atlantique on ambient systems, enabling well ageing at home for the elderly adults or dependent populations [1]. We expect this thesis to provide technology likely to relieve the burden on gerontologists and elderly-care facilities, and alleviate the caregiver shortage by offering some automatic support to the task of monitoring elderly or handicapped people, enabling them to age-at-home while still being followed by medical specialists using automated means. Expected contributions of the thesis Scientific goals: (1) Determine the set of human activities relevant for health diagnosis, (2) Implement a state-of-the-art model for audio-based activity recognition and validate its function by clinicians, (3) Develop a model for summarizing the evolution of human activity over time intervals of arbitrary duration (typically spanning from days to months and possibly years). Expected outcomes of the PhD: (1) A model for semantic summarization of human activity, based on sound recognition of activities of daily living. (2) A proof of concept for this model Candidate profile and required skills: • Master Degree in Computer Science (or equivalent) • Programming and Software Engineering skills (Python, Git, Software Architecture Design) • Data science skills • Machine learning skills • English speaking and writing skills References: [1] Damien Bouchabou. “Human activity recognition in smart homes : tackling data variability using context-dependent deep learning, transfer learning and data synthesis”. Theses. Ecole nationale supérieure Mines-Télécom Atlantique, May 2022. url: https://theses.hal.science/tel-03728064. [2] Detection and Classification of Acoustic Scenes and Events (DCASE). url: https://dcase.community/challenge2022/task-soundevent-detection-in-domestic-environments (visited on 07/01/2022). [3] P Durga et al. “When less is better: A summarization technique that enhances clinical effectiveness of data”. In: Proceedings of the 2018 International Conference on Digital Health. 2018, pp. 116–120. [4] Akshay Jain et al. “Linguistic summarization of in-home sensor data”. In: Journal of Biomedical Informatics 96 (2019), p. 103240. issn: 1532-0464. [5] Mostafa Haghi Kashani et al. “A systematic review of IoT in healthcare: Applications, techniques, and trends”. In: Journal of Network and Computer Applications 192 (2021), p. 103164. Work Plan: The thesis will be organised in the following steps: (1) Definition of pertinent sounds and activities for health diagnosis, (2) Hardware set-up, (3) Dataset constitution, (4) Activity recognition, (5) Diarization of activities, (6) Summarization, (7) Validation in a real environment. Application: To apply for this position, please send an email with your Curriculum Vitae, a document with your academic results (if possible), and a couple of lines describing your motivation to pursue a PhD to mihai[dot]andries[at]imt-atlantique[dot]fr before 16 May 2023. Additional Information : Application deadline : 16 May 2023 Start date : Fall 2023 Contract duration: 36 months Localisation - Location : Brest (France) and Lisbon (Portugal) Contact(s) : Mihai ANDRIES (mihai[dot]andries[at]imt-atlantique.fr) Plinio Moreno (plinio[at]isr.tecnico.ulisboa.pt)
We’re happy to announce a new research position in the field of speech- and text anonymization at German Research Center for Artificial Intelligence, Berlin, Germany. We’re looking for a full time Researcher or Junior Researcher level, and offer a 2 years contract with optional prolongation and PhD perspective.
© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.