SoNaR Corpus

The SoNaR Corpus 1.2.1 contains the final results of the STEVIN project SoNaR.The STEVIN SoNaR project has resulted in two datasets, viz. SoNaR-500 and SoNaR-1.

SONAR-500 contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types. All texts have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. In the case of SoNaR-500 all annotations were produced automatically, no manual verification took place.