Call 1

The following projects have been accepted in Call 1:

Acronym	Name
AAM-LR	Automatic Annotation of Multi-modal Language Resources
Adelheid	A Distributed Lemmatizer for Historical Dutch
ADEPT	Assaying Differences via Edit-Distance of Pronunciation Transcriptions
DUELME-LMF	Converting DUELME into LMF format
INTER-VIEWs	Curation of Interview Data
MIMORE	Microcomparative Morphosyntax Research Tool
SignLinC	Linking lexical databases and annotated corpora of signed languages
TICCLops	Text-Induced Corpus Clean-up online processing system
TDS Curator	A web-services architecture to curate the Typological Database System
TQE	Transcription Quality Evaluation
WFT-GTB	Integrating the Wurdboek fan 'e Fryske Taal into the Geïntegreerde Taalbank

AAM-LR - Automatic Annotation of Multi-modal Language Resources

Projectcoördinator: Prof.Dr. L. Boves (Radboud University Nijmegen)

Abstract:
The AAM-LR project aims at building a demonstrator of a web service that will help filed researchers to annotate audio- and video-recordings. At the top level the service will mark the time intervals at which specific persons in the recording are speaking. In addition, the service will provide a global phonetic annotation, using language independent phone models and phonetic features. Speech will be separated from speaker noises such as laughing. The output of the web service will be fed into the ELAN/ANNEX editor, to facilitate further manual annotation. Integration in the ELAN/ANNEX framework will reduce the need for extensive user interface development. The annotations will conform to ISOCat and potential new categories will be added to ISOCat. The AAM-LR project is completely based on open domain software.

Adelheid - A Distributed Lemmatizer for Historical Dutch

Projectcoördinator: Dr. H. van Halteren (Radboud University Nijmegen)
Projectplan

Abstract:
This project aims at providing a web-application with which an end user can have historical Dutch text tokenized, lemmatized and part-of-speech tagged, using the most appropriate resources (such as lexica) for the text in question. The need to consistently use appropriate resources leads to the intuitively obvious strategy of placing this service in the Clarin infrastructure. For each specific text, the user can then select the best resources from those available in Clarin, wherever they might reside, and where necessary supplemented by own lexica. During the project a demonstrator for the distributed automatic lemmatization will be created, with some 14th century charters as test texts as well as corresponding resources.

ADEPT - Assaying Differences via Edit-Distance of Pronunciation Transcriptions

Projectcoördinator: Dr. Charlotte Gooskens (University of Groningen)

Abstract:
The goal of the project is to provide a web application capable of measuring the differences in sets of phonetic (or phonemic) transcriptions via edit distance. The C-based software on which the application is to be based has existed since about 2001, has been freely distributed since 2004, and is used modestly but profitably outside of Groningen (about 20 users). But it is too complex for many potential users, esp. dialectologists and second-language learning specialists. The goal of the proposed work is to develop and implement a graphical user interface and to make the string comparison facility available as a web application. This should enable wider experimentation with the techniques. We will consult with and collaborate with specialists for the expertise needed in establishing the application, to ensure that conforms to emerging standards, and also to attempt to use the needs of the application as a test case for formulating more general infrastructure requirements for CLARIN and CLARIN-NL.

DUELME-LMF - Converting DUELME into LMF format

Projectcoördinator: Prof.dr. J.E.J.M. Odijk (Utrecht University)
Projectplan

Abstract:
The goal of the project is develop a converter from DUELME into LMF format, and vice-versa (for a subset of lexicons in LMF-format), to apply the first converter to create a curated DUELME resource fully compliant with standards supported by CLARIN. A mapping will be defined between DUELME-specific data categories and (possibly newly created) ISOCAT data categories to ensure semantic interoperability of the curated resource with other resources and tools. A document will be produced in which limitations of and desiderata for the LMF standard, ISOCAT and other infrastructural aspects will be described.

URL

PID

INTER-VIEWs - Curation of Interview Data

Projectcoördinator: Dr. H. van den Heuvel (Radboud University Nijmegen)

Abstract:
The INTER-VIEWS project will make a corpus of interview data available to the community of researchers in the humanities. For this type of research data a dedicated infrastructure does not exist, there is a lack of standardized metadata and lack of coordinated disclosure for research. The project will address these issues and provide solutions. The interviews that are made available in the INTER-VIEWS project originate from the Dutch Veterans institute and are considered as representative for other collections of interview data. The building blocks that are created, combined and optimized are durable storage, persistent identification, search and harvesting mechanisms, annotation tools, and privacy protection mechanisms. The aim of the project is to deliver generic, standardized solutions for the optimal scientific use of interview data. Thus, the project will enable researchers to review and share each others views on the data; therefore, we call the project INTER-VIEWs. The building blocks of the infrastructure are based on results created in related previous projects. The corpus of interview data will be curated by DANS, acting as a Clarin A/B Centre.

MIMORE - Microcomparative Morphosyntax Research Tool

Projectcoördinator: prof. dr. Sjef Barbiers (Meertens Institute and Utrecht University)
Projectplan

Abstract:
The demonstrator tool MIMORE will be based on three databases: (i) DynaSAND, a corpus of elicited speech and text collected between 2000-2005 to hart the syntactic variation at the clausal level in 267 dialects of Dutch spoken in the Netherlands, Belgium and North-West France; (ii) DiDDD, a corpus of licited speech and text collected between 2005-2009 to chart the syntactic variation at the level of nominal groups in the same langauge area; (iii) MAND a corpus of elicited speech and text collected between 1980 and 1995 to chart morphological (word-level) variation. In the proposed tool the three databases will have a common search engine that makes it possible to investigate potential correlations between variables at the three different linguistic levels, cartographic functionality enabling the user to visualize these correlations and statistical functionality to analyze them.

Sign-LinC - Linking lexical databases and annotated corpora of signed languages

Projectcoördinator: Dr. O.A. Crasborn (Radboud University Nijmegen)
Projectplan

Abstract:
This project aims to link two independently evolved data sets for a signed language: the Corpus NGT and the lexical database of the Dutch Sign Centre. The first is a corpus of video texts that is already fully compliant with CLARIN standards, while the latter is an independently evolved Microsoft SQL database. Both are prototypical for the situation of signed language resources in the world: corpora of running signing have only recently been under development and typically use ELAN as the annotation tool, while lexical databases have typically evolved as standalone applications to produce dictionaries in books or on CD/DVD-ROMs. In order to establish the link in both directions a conversion of the lexical database to the CLARIN LMF standard for lexica is planned in the project. Further, enhancements to the existing CLARIN tools ELAN and LEXUS are created to start the exchange between the two domains.

TICCLops - Text-Induced Corpus Clean-up online processing system

Projectcoördinator: Martin Reynaert (Tilburg University)
Projectplan

Abstract:
We propose a demonstration project which will allow CLARIN users to submit their corpora for fully automatic spelling correction and normalization by TICCLops, the online processing version of our core component TICCL. This system should be widely applicable in all manner of curation projects and lexicographical work.

TDS Curator - A web-services architecture to curate the Typological Database System

Projectcoördinator: Dr. Alexis Dimitriadis (Utrecht University)

Abstract:
The Typological Database System (TDS) currently provides integrated access to multiple independently developed typological databases through a common web interface. While the system is currently operational, TDS Curator will make the TDS into a sustainable service that conforms to CLARIN infrastructural requirements. The system consists of an off-line back end that imports data and organizes it into an integrated system, and a real-time front end that manages user interaction and queries. At present, user access to the system is only possible through the interactive web interface; by implementing a web services architecture to properly separate the user interface from the data access layer, it will be possible for the latter to be queried by other components of the integrated environment envisaged by CLARIN. In this way, the data contained in the TDS can be transparently available to other tools and resources; the user interface itself would also be usable with other compatible data resources. TDS Curator will be transferred from the original developers to be hosted by DANS as an empirical test case for DANS serving as a Clarin A/B centre.

TQE - Transcription Quality Evaluation

Projectcoördinator: Dr. H. Strik (Radboud University Nijmegen)
Projectplan

Abstract:
The current proposal is about a completely automatic Transcription Quality Evaluation (TQE) tool. Input is a corpus with audio files and phone transcriptions (PTs). Audio and PTs are aligned, phone boundaries are derived, and for each segment-phone combination it is determined how well
they fit together, i.e. for each phone a TQE measure (a confidence measure) is determined, e.g. ranging from 0-100%, indicating how well the fit is, what the quality of the phone transcription is. The output of the TQE tool will consist of a TQE measure and the segment boundaries for each phone in the corpus. The tool will be useful for validating, obtaining, and selecting phone transcriptions, for detecting phone strings (e.g. words) with deviating pronunciation, and, in general, it can be usefully applied in all research - in various (sub-)fields of humanities and language and speech technology (L&ST) - in which audio and PTs are involved.

WFT-GTB - Integrating the Wurdboek fan 'e Fryske Taal into the Geïntegreerde Taalbank

Projectcoördinator: drs. Hindrik Sijens (Fryske Akademy)

Abstract:
The data curation of the Wurdboek fan de Fryske Taal (WFT) database and demonstration of the data in the Geïntegreerde Taalbank (GTB) dictionary web application. The original dictionary data are stored in a BRS-database with almost no metadata added.
In this project the original data will be transferred into XML with TEIannotation. In order to achieve this result, preliminary work has to be done (additional work on the WFT-data, fixing mistakes and inconsistencies), a set of metadata is defined, a dictionary entry parser to convert the data to XML will be developed.
Before the TEI-annotated WFT can be demonstrated in the GTB, a list of word classes and an index of sources and references has to be compiled. In order to implement the WFT data into the GTB application, additional adaptations have to be made.
The TEI encoding scheme used will be fully documented. The back end (search layer) for retrieval in scholarly dictionaries will be fully documented as a web service. This will contribute to the establishment of a CLARIN standard for encoding of and retrieval in scholarly dictionaries as a linguistic resource.

PID