Welcome to the Berlin Shona Novel (BeShoNo) Corpus!

The Corpus

The Berlin Shona Novel Corpus is one outcome of the research project "Changing Patterns in the Shona Novel" conducted at Humboldt University Berlin between 2013 and 2016, and funded by the DFG (Deutsche Forschungsgemeinschaft – German Research Council). It consists of annotated extracts from three Shona novels from Zimbabwe: Pfumo reropa by Patrick Chakaipa of 1961, Ndiko kupindana kwamazuva by Charles Mungoshi of 1975 and Mapenzi by Ignatius Mabasa of 1999. Approximately 40 percent of the total amount of these texts was fed into the software “The Field Linguist's Toolbox”, a semi-automatic tool used for morphological analysis (provided free of charge by SIL). The software breaks the text into morphemes, to which glosses and part of speech tags are assigned.

Team Members

The principal investigator of the project was Flora Veit-Wild, professor emerita of African Literatures and Cultures at Humboldt University. The research team members were Katja Kellerer, Isabelle Nguyen, Tsitsi Nyoni of Great Zimbabwe University (May 2013 - February 2014), Dr. Aquilina Mawadza (March/April 2014) and Dr. Jacob Mapara of Chinhoyi University of Technology (April 2015 – June 2016). Tom Güldemann, professor of African languages at Humboldt University, acted as linguistic advisor to the project. See this link for a brief description of the project in English and German on the homepage of the Department of African Studies. During a three-month stay in Harare, the research team was joined by two linguists from the University of Zimbabwe, Dr. Francis Matambirofa and Dr. Zvinashe Mamvura. With their help, the parsing of the texts was completed.

Presentation of Findings and Future Projects

The establishment of the linguistic corpus went hand in hand with a literary analysis of the three Shona novels. Preliminary papers showing how the linguistic data could be used from a literary angle were presented at a workshop hosted by Chinhoyi University of Technology on February 23 2016. The workshop was attended by a group of around 20 Zimbabwean experts from linguistics and literary studies who provided critical input into the work done by the Humboldt research team.

The workshop attendants also discussed the possibilities of incorporating the Berlin Shona Novel Corpus into corpus work done at the Universities of Oslo and Zimbabwe within the framework of the ALLEX project. A future project would combine the detailed morphological analysis developed in Berlin with the vast amount of data compiled over many years in the ALLEX project - including newspaper articles and recordings of spoken language - thus potentially resulting in the first major morphologically annotated corpus of Shona.

The Data

The Berlin Shona Novel Corpus represents the first attempt to analyse and annotate prominent literary works in Shona with the help of the Field Linguist’s Toolbox. In its present form, the corpus still contains open questions and inconsistencies. However, as a substantial outcome of the Berlin research project, it is made available here for interested researchers.

Annotated texts are available for download, individually as well as in bundles (sorted by author; see the dash tiles at the top of this page). The software can be downloaded here. The primary levels of analysis are:

  • morpheme breaks (labelled 'mb') 
  • glosses ('ge')
  • part of speech tags ('ps')

The secondary levels are

  • source language ('sc', for borrowings and code-switching/mixing) 
  • usage ('ue', indicating code-switching/mixing) 
  • discourse notes ('nd', for slang etc.) 

The latter pertain to 'modern' phenomena of code-switching, borrowing and slang and were used only in cases where they were relevant. For this reason, they do not appear in the Pfumo reropa files, but do in the extracts from Mapenzi, and, to some extent, from Ndiko kupindana kwamazuva

The text ('tx') and free translation ('ft') lines are not part of the parsing process. For more details on morphological segmentation and glossing conventions, see our "Conventions and Settings".

In addition, digital versions of the entire original novels' texts, which formed the basis of the Toolbox project, are provided. English translations of the works, as yet unpublished, are also available. Due to copyright issues, these files can be accessed as previews only. Print copies are available at the library of the Institute for Asian and African Studies at Humboldt University Berlin.
Please use the following information when citing the corpus in academic publications or conference papers:
Isabelle Nguyen, Tom Güldemann, Katja Kellerer, Zvinashe Mamvura, Francis Matambirofa, Aquilina Mawadza, Tsitsi Nyoni, and Flora Veit-Wild. 2016. Berlin Shona Novel (BeShoNo) Corpus. Berlin: Humboldt Universität zu Berlin. (Available online at https://rs.cms.hu-berlin.de/beshono/pages/home.php, accessed on XXXX-XX-XX)

Questions regarding the project or the data should be directed to Flora Veit-Wild at flora.veit-wild@rz.hu-berlin.de or Tom Güldemann at tom.gueldemann@rz.hu-berlin.de.

Work on this project was funded by the Deutsche Forschungsgemeinschaft (DFG).