
Swiss Parliaments Corpus and Data Transformation Project
"Explore the Swiss Parliaments Corpus project focusing on converting Swiss German speech into Standard German text. Discover the data transformation challenges, processes, and results, including a corpus with 293 hours of training data available for download under the MIT license. Dive into the details of this fascinating linguistic endeavor."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Swiss Parliaments Corpus Michel Pl ss, Lukas Neukom, Christian Scheller, Manfred Vogel Institute for Data Science FHNW
The Task Example: Swiss German: Ide Abfahrt hetter de s chsti Platz beleit. Standard German: In der Abfahrt belegte er den sechsten Platz. 15.06.2021 2
The Task - Challenges No standardized writing system Spelling ambiguities Huge vocabulary size Missing text processing tools Dialect diversity Lack of public training data 15.06.2021 3
Data Source Debates from Swiss parliaments speaking Swiss German. with Standard German word-for-word minutes. willing to share the recordings for research purposes. First parliament: Grosser Rat Kanton Bern Recordings and minutes are available on the website 460 hours of recordings (as of 22.07.2020) Recording length: 28 242 min https://www.gr.be.ch/gr/de/index/sessionen/sessionen.html 15.06.2021 4
Data Transformation Automatic procedure Inputs: recording and minutes of a parliament meeting Output: sentence-level Swiss German speech and Standard German text pairs Example output: Der Medienstandort Bern wird aber in letzter Zeit laufend geschw cht. 15.06.2021 5
Data Transformation T DE bad quality T DE ground truth Amazon Transcribe DE model Minutes manual transcription CH Create audio snippets using start and end times Global alignment 00:53 00:45 damit ber sie hi mehr selbstverst ndlich nachgefragt mehrheit hauses bewegung Da wir ber 150 Personen sind, haben wir selbstverst ndlich nachgefragt. Wir haben jedenfalls die Bewilligung. 15.06.2021 6
Data Transformation Not all speech-text-pairs are perfect Per-pair alignment quality estimator Filter based on estimated quality Trade off corpus size and quality Faster training time with same WER / BLEU 15.06.2021 7
Results - Data Swiss Parliaments Corpus with 293 hours of training data MIT license, available for download https://www.cs.technik.fhnw.ch/i4ds-datasets 15.06.2021 8
Model Deep Neural Network End-to-end: one model directly converts Swiss German speech to Standard German text (Speech Translation) Avoids Swiss German text with all its problems One model handles all dialects Requires sentence-level speech-text-pairs as training data Requires large amounts of training data (hundreds to thousands of hours) 15.06.2021 9
Model Sequence-to-sequence / encoder-decoder model Conformer architecture [1] Hybrid Connectionist Temporal Classification (CTC) / attention approach [2] Implemented using the ESPnet framework [3] [1] Gulati et al. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. Proceedings of Interspeech. [2] Watanabe et al. 2017. Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing. [3] Watanabe et al. 2018. ESPnet: End-to-end speech processing toolkit. Proceedings of Interspeech. 15.06.2021 10
Results - Model Dataset Word Error Rate (%) Swiss Parliaments Corpus Test Set (mostly BE) 27.8 SwissText 2021 Task 3 Test Set (all dialects) 41.5 SwissText 2021 Task 3 Test Set (BE) 37.6 SwissText 2021 Task 3 Test Set (BL / BS) 46.5 SwissText 2021 Task 3 Test Set (SG) 43.9 SwissText 2021 Task 3 Test Set (ZH) 40.7 15.06.2021 11
Outlook Additional parliament Stadtrat Bern processed 700 hours of training data (unfiltered) 4 promising parliaments to be processed AG, AR, BL, OW dialects Additional data through dialektsammlung.ch Further improve models, e.g. Swiss German Speech to Standard German Text shared task 15.06.2021 12