veganism.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
Veganism Social is a welcoming space on the internet for vegans to connect and engage with the broader decentralized social media community.

Administered by:

Server stats:

297
active users

#commonvoice

0 posts0 participants0 posts today
Replied to BlaCHp

❤️‍🔥โปรแกรมสุดพิเศษ!!❤️‍🔥 ในวันเสาร์ที่ 15 มีนาคม 2568 เวลา 13.30 น. พบกับ speakers สุดสวย @latenightdef ร่วมพูดคุยในหัวข้อ 🗺️📍Indoor mapping for OpenStreetMap using OsmInEdit and IndoorEqual📍🗺️ eventyay.com/e/4c0e0c27/sessio

ทั้งทีมงาน 🎙️Mozilla Common Voice 🎧 และ speakers ยินดีอย่างมากที่จะได้แบ่งปันประสบการณ์ใหม่ๆ กับทุกคน💞 อย่าลืมมาพบกันให้ได้เลยน้า ที่งาน ✨ FOSSASIA Summit 2025 ✨

Come and experience something special with us! ❤️‍🔥💞

🎙️ประชาสัมพันธ์ 🎧
ขอเชิญชวนทุกท่านเข้าร่วมงาน ✨ FOSSASIA Summit 2025 ✨ ในวันพฤหัสบดีที่ 13 ถึงวันเสาร์ที่ 15 มีนาคม 2568 ณ True Digital Park eventyay.com/e/4c0e0c27

โดยภายในงานมีการพูดคุยกับ speakers รวมถึงจัดบูธเกี่ยวกับชุมชน 🌟 open source 🌟 และอื่นๆ อีกมากมาย

หนึ่งในนั้นคือบูธ ✨ Mozilla Common Voice ✨🎧🎙️เทคโนโลยีการสร้างชุดข้อมูลเสียงที่สามารถเข้าถึงได้แบบสาธารณะ ทางทีมงานภูมิใจนำเสนอและตั้งตารอได้ร่วมกิจกรรมกับทุกคน😊💕

Has anyone here received this mail from Mozilla regarding commonvoice?

>>quote
Mozilla has always fought for an open, accessible internet that puts people in control — no matter the obstacles. Today, we need to share a significant challenge: Mozilla Common Voice is at risk of losing $1.05 million in U.S. government funding due to Donald Trump and Elon Musk's interference with science and technology grants.1, 2

Mozilla Common Voice is the largest open, crowd-sourced speech recognition dataset, designed to make voice-enabled technology available to the world’s 7000 languages. This funding was meant to help expand our work over the next three years, letting us build features in response to community demand — like code-switching datasets and Indigenous language licensing. Now, we don’t know if we’ll receive any of that support.

But we’re not backing down. We’re adapting our roadmap, staying nimble, and finding new ways to sustain this work. And we can’t do it alone. That’s why we’re turning to you for support.

Last year, 100,558 people contributed to Mozilla. That's an incredible show of support. If you've been waiting for the right moment to donate, now is the time. Your contribution will help sustain work like Common Voice and advance an open and accessible internet for all.

Make a $10 USD contribution to Mozilla today to help build an internet that puts people first.

<< end quote

#Mozilla #commonvoice #firefox #Trump #funding #Musk

commonvoice.mozilla.org/en?for

commonvoice.mozilla.orgMozilla Common Voice

It's been another big year as I work towards completing my #dissertation on voice dataset documentation and how it influences how well #speech technologies work for all voices at the #ANU School of Cybernetics - with big thanks to my supervisors, Elizabeth Williams, Alexandra Zafiroglu, Jofish Kaye and Paul Wong 黃仲熙.

I've wrapped up a partnership with Mozilla's #CommonVoice team, which let me explore the hashtag#dataset in a lot more detail - big thanks EM Lewis-Jong, @jessie Dmitrij Feller in particular.

It was an incredible honor to keynote #FF24 at the National Film and Sound Archive of Australia alongside Peter-Lucas Jones of Te Hiku Media, expertly facilitated by Keir Winesmith - thanks @ingridbmason and team for the opportunity - and stay tuned for a little project we are working on - we know you're all eager for the video of this keynote, but we're adding a little more magic.

I helped out with @everythingopen Media and Comms this year, and am looking forward to speaking in January in Adelaide.

A huge thanks to my fellow #PhD buddies - Lorenn Ruster, @nedcpr, Glen Berman, Tom Chan, Danny Bettay, Charlotte Bradley, @Amirasadi, Memunat Ajoke Ibrahim and the later cohorts for all your support, shut up and write sessions and intellectual growth.

When was the last time that you actually contributed to an open source project?

I'm certain that you've heard of common voice at Mozilla

In case you haven't The languages that need more data are All of them. So even contribute 15 samples a Day does a lot on the whole.

I had slacked off on my Common Voice contributions, but I'm now picking it up again

@RL_Dane

🖋️ #Mozilla #commonVoice #OpenSource #contribute #samples #programming

commonvoice.mozilla.org

The Mozilla #CommonVoice #dataset v20 was released yesterday - the largest open #speech dataset in the world. My #dataviz, linked below, shows a continuation of patterns seen for some years now:

➡️ There's more data collected for #Catalan (ca) than for #English (en) - testament to the independence and language reclamation efforts in Catalunya. Language and cultural transmission are deeply intertwined.

➡️ Some of the newer #languages to Common Voice, like #Ligurian / #Genoese (lij) have contributions from mostly older speakers, which is unusual in comparison to the rest of the dataset. This may reflect the population that currently speak those languages - as many regional languages in Italy are in rapid decline.

➡️ Some languages such as Eastern Mari / Meadow Mari (mhr) - a #Uralic language spoken in the Mari-El Republic within Russia - have samples from predominantly female-identifying speakers, again contrasting to the rest of the dataset. Other languages here include #Cantonese (yue), #Georgian (ka), and #Kalenjin (kln).

➡️ A key part in the preparation of the Common Voice dataset is the validation of utterances to assure they match their written transcription - which requires at least two validations by separate speakers. Some newer languages to Common Voice, such as Erzya (myv) and Moksha (mdf), both Uralic languages, have nearly 100% validation.

What are your interpretations of the dataset?

observablehq.com/@kathyreid/mo

If you're a #language nerd like I am, then you won't have missed the @mozilla #CommonVoice v19 #speech #dataset release - which now features 131 languages! Here's my #dataviz, done in @observablehq of the v19 #metadata coverage.

I've updated the visualisation this time around with human-readable language names instead of their ISO-639 or BCP-47 language codes to make it it easier to read.

There's some interesting observations:

▶ Catalan (ca) continues to be leader in terms of data - speaking volumes about the efforts to revitalise culture and language in Catalunya. It's also one of the few languages that has data for all age groups, particularly older speakers - this sort of data is missing for most other languages.

▶ Kiswahili (sw) is one of the languages where there is more data for female-identifying speakers than for male-identifying speakers ♀ - although Japanese (ja), Western Mari (mrj) and Luganda (lg) do pretty well here, too!

▶ Sentence domains can now be categorised, and although most new sentences are "general", Albanian (sq) has a lot of sentences related to law and government.

▶ Tsonga (ts), a Bantu language spoken in Southern Africa, has dethroned Icelandic (is) as the language with the highest average utterance duration. I don't know enough about Tsonga to speculate why - it's a somewhat agglutinative language, but many Tsonga works are generally short.

▶ Bengali / Bangla (bn) has a significant amount of data that is not yet validated, and therefore does not appear in training / dev / test splits. There is a similar case for many languages new to Common Voice - it takes time to validate.

▶ The language with the highest number of average contributions per speaker is Taita (dav), a Bantu language from Kenya.

What do you make of the data visualisation? Are there any other insights you can see?

Big thanks to the CV team for all their efforts - EM, Jessica Rose, Dmitrij Feller and Justin Grant.

#linguistics

observablehq.com/@kathyreid/mo

Observable · Mozilla Common Voice v19 dataset metadata coverageThis visualisation uses "@d3/stacked-horizontal-bar-chart" to visualise the Common Voice metadata coverage. The original data is taken from the Common Voice `cv-dataset` repository - direct link Table of contents Splits by age range - shows how many clips have been provided by speakers of different age ranges for each locale (language) Splits by age range scaled to 100% - as above, but scaled to 100% so that the metadata coverage of low resource languages is more visible Splits by gender - shows how many cl

Each quarter, when the new @mozilla #CommonVoice #dataset is released, I do a #dataviz using @observablehq of its #metadata coverage, across all 100+ languages, based on the JSON summary that is part of the release.

Some of my observations from the v18 release are:

💡 #Catalan (ca) now has a larger dataset than English, based on the number of audio recordings (including validated and yet-to-be-validated recordings). It’s also an interesting dataset because the number of recordings per unique contributor is relatively low (around 80). This means it’s likely to have a high diversity of speakers in the dataset, which is useful for building #ASR models that generalise well to many speakers.

Catalan also appears to have the highest percentage of audio recordings by older speakers - e.g. speakers in their forties, fifties and older. Again, this highlights the diversity of speakers in the Catalan dataset.

💡 Although it’s very early to see any trends from the decision by Common Voice to expand the range of options for gender identity, we are starting to see some data being tagged with the new options that are available. For example, in #Uyghur (ug), we now have data tagged as “do not wish to say”. I don’t want to draw connections between the geopolitical situation in that area and the desire of data contributors not to provide demographic data which may in some way identify them without more evidence, but I think it’s telling that the first use of these expanded metadata categories appears in a language that is spoken in a contested geography.

💡Similarly, it’s very early to identify trends in sentence domain classification - as most of the sentences that do have a domain tag are labelled “general”, although “health_care” sentences are occurring frequently in languages such as #Albanian (sq).

💡#Bangla (Bengali) (bn) continues to have a very large number of yet-to-be-validated audio recordings. Due to this, the train split for Bangla is quite small.

💡#Dholuo (luo), a language spoken in Kenya and Tanzania, is an outlier in terms of the number of distinct data contributors to the dataset - this language has a very high average number of contributions for per contributor. This is often seen in languages that are new to Common Voice, before they have been able to recruit more contributors. Dholuo has nearly 5 million speakers.

💡 The language with the highest average utterance duration is by far #Icelandic (is) at over 7 seconds. This may be because Icelandic has many words with several syllables, which take longer to pronounce. Consider "the cat sat on the mat" in English, cf "kötturinn sat á mottunni" in Icelandic.

Big thanks to all data contributors in this release for your donated utterances, and to Dmitrij Feller, @jessie, Gina Moape, EM Lewis-Jong and the team for all your efforts.

What are your thoughts? What conclusions do you draw?

observablehq.com/@kathyreid/mo

Delighted to be able to publicise a paper that was presented at the @ALTAnlp 2023 Workshop at the end of last year, co-authored with my #PhD supervisor, Associate Professor @eltwilliams, and written as part of my research at #ANU School of Cybernetics.

Titled "Right the docs: Characterising voice dataset documentation practices used in machine learning", it combines both exploratory interviews and documentation analysis to characterise how large voice datasets - e.g. #LibriSpeech, @mozilla's #CommonVoice, and several others, document their #metadata.

Unsurprisingly, it finds that the #dataset #documentation practices seen currently do not meet the needs of the #ML practitioners who use these datasets.

We show, once again, in the words of Nithya Sambasivan - "everyone wants to do the model work, but nobody wants to do the data work" ...

aclanthology.org/2023.alta-1.6

#RightTheDocs #WriteTheDocs

Citation:

Reid, K., Williams, E.T., 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning, in: Muresan, S., Chen, V., Casey, K., David, V., Nina, D., Koji, I., Erik, E., Stefan, U. (Eds.), Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. Association for Computational Linguistics, Melbourne, Australia, pp. 51–66.

ACL AnthologyRight the docs: Characterising voice dataset documentation practices used in machine learningKathy Reid, Elizabeth T. Williams. Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. 2023.

For the past couple of years, as each new @mozilla #CommonVoice dataset of #voice #data is released, I've been using @observablehq to visualise the #metadata coverage across the 100+ languages in the dataset.

Version 17 was released yesterday (big ups to the team - EM Lewis-Jong, @jessie, Gina Moape, Dmitrij Feller) and there's some super interesting insights from the visualisation:

➡ Catalan (ca) now has more data in Common Voice than English (en) (!)

➡ The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). Perhaps Icelandic words are longer? I suspect so!

➡ Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).

➡ Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid) given the geopolitical instability in Russia currently.

See the visualisation here and let me know your thoughts below!

➡ observablehq.com/@kathyreid/mo

ObservableMozilla Common Voice v17 dataset metadata coverageThis visualisation uses "@d3/stacked-horizontal-bar-chart" to visualise the Common Voice metadata coverage. The original data is taken from the Common Voice `cv-dataset` repository - direct link Table of contents Splits by age range - shows how many clips have been provided by speakers of different age ranges for each locale (language) Splits by age range scaled to 100% - as above, but scaled to 100% so that the metadata coverage of low resource languages is more visible Splits by gender - shows how many cl

Last week, as part of my #PhD program at the #ANU School of #cybernetics, I gave my final presentation, which is a summary of my methods and #research findings. I covered my interview work, the #dataset documentation analysis work I've been doing and my analysis work around #accents in @mozilla's #CommonVoice platform.

There were some insightful and thought-provoking questions from my panel and audience members, and of course - so many ideas for future research inquiry!

A huge thanks to my panel, chaired so well by Professor Alexandra Zafiroglu, to Dr Elizabeth Williams, my meticulous, methodical and always-encouraging Primary Supervisor, and to my co-supervisors Dr Jofish Kaye and Dr Paul Wong 黃仲熙 for their deep expertise in #HCI and #data respectively.

Similarly, a huge thank you to my #PhD cohort - Charlotte Bradley, Tom Chan, Danny Bettay and Sam Backwell - as well as the other cohorts in the School - for your encouragement and intellectual journeying.

I'm delighted to be presenting this paper, joint work with my doctoral supervisor, @eltwilliams, at the upcoming #EAAMO23 @ACM conference (presenting remotely to Boston from Australia - how good is hybrid?!)

#CommonVoice and #accent choice: data contributors describe their spoken accents in diverse ways

The paper reports on an analysis of accent data in #CommonVoice, and the ways in which data contributors self-describe their accents - a feature which has been available in the platform since 2022.

dl.acm.org/doi/10.1145/3617694

If you'd like to see the @observablehq code behind the #dataviz in the paper, you can access it here:

observablehq.com/@kathyreid/ph

Good morning everyone! Here's my latest #Connections #Introduction #Introductions #TwitterMigration post, where I curate interesting accounts for you to follow from across the #Fediverse :fediverse:

@maryrobinette is a #writer #author, and I am listening to her incredible #LadyAstronaut series at the moment. If you love #SciFi (esp hard scifi) you should read it, too! 🇺🇸

@sayashk is a #ComputerScience #PhD candidate at #Princeton, who is researching failures in #ML (he's also co-running a workshop on open #FoundationModels in about 15 hours, see my previous posts for more info) 🇺🇸

@michcampbell is Dr Micha Campbell and she is a #PalaeoClimate #PostDoc living on #Dharawal country 🇦🇺

@mthv is a #Research #Engineer who works in #GIS at #CNRS 🇫🇷

@astrolori is Lori and she is into #OpenSource, #fashion, #space and #tech #WomenInSTEM 🇨🇦

@pandas_dev is the official account for #pandas, the #Python #DataAnalysis tool 🐍 📊

@jessie is a lover of #languages and helps run #CommonVoice, @mozilla 's open #voice #data set, which now supports over 100 languages. She also teaches #WebDev and loves #hiking. She's awesome you should follow her 🇬🇧

That's all for now, please do share your own lists so we can create deeper connections, and a tightly-connected community here

I'm reminded here of @maryrobinette's short story - "Red Rockets" - "She built something better than fireworks. She built community."

It's been a while since I did an #Introductions #Connections #Introduction #TwitterMigration post, where I curate a list of interesting people in the #Fediverse :fediverse: you might want to follow - helping us create valuable communities and connections.

@nrennie is a #lecturer #researcher in #health #DataScience at #LancasterUniversity, and she does amazing work in #DataViz, primarily with #RStats

@mkohler is a #SoftwareEngineer and #EngineeringManager, and a long-time contributor to all things @mozilla, and in particular, the #CommonVoice project

@isomeme is a #SoftwareEngineer too and she practices Hermetic #Magick

@skc is Scott Kingsley Clark, who is also a #SoftwareEngineer and Lead Dev of the #Pods framework for #WordPress

@aehdeschaine is interested in #libraries #archives #architecture and #PaleoGeography

@hclarke is a Senior #Research Fellow at #UniMelb. He researches #WildFire and #ClimageChange 🔥 🌲

@blogdiva is Liza, and she, well she's generally awesome and shares my views on Space Karen / ApartheidBoi 🇵🇷

Aprofite per a recordar-vos que si voleu que els assistents de veu, com la Siri, Alexa, OK Google, etc. parlen la nostra llengua, una manera d’aconseguir-ho és participant en el @commonvoicecat de @mozilla

Podeu llegir les frases que l’aplicació va oferint i enregistrar la vostra veu, o podeu validar els talls de veu d’altres usuaris per a comprovar que es corresponen amb el text en pantalla.

Cal molta ajuda i molts esforços, que no decaiga!! 💪🏼💪🏼
#CommonVoice #CommonVoiceCAT