LOT Summerschool Course: CLARIN for Linguists

Monday, June 23, 2014 - 09:00 to Friday, June 27, 2014 - 11:00

Contact

Jan Odijk

Trans 10

3512 JK Utrecht

j.odijk@uu.nl

http://www.uu.nl/staff/JEJMOdijk/0

Course level

Introductory

Course description

CLARIN is a research infrastructure for humanities researchers who work or want to work with digital language data. This European infrastructure is currently being constructed. Many components and services in the CLARIN infrastructure can already be used. This course will introduce the CLARIN infrastructure, and some components and services contained in it that are relevant to linguists.

In a nutshell, the CLARIN infrastructure aims to achieve the following:

A researcher can easily find all data, tools and services relevant to his/her research
A researcher can apply tools and services to data without any technical background, or without any ad-hoc adaptations to the tools or data
A research can store data and tools resulting from his/her research in the CLARIN infrastructure, so that they become available to other researchers

The course will teach about specific services and application in the CLARIN infrastructure that contribute to these goals, with a focus on contributions made by the Netherlands CLARIN-NL project. Some sessions will consist of a lecture followed by a hands-on session to learn to work with CLARIN tools.

The day to day program has is as follows:

Monday

Introduction – Jan Odijk, Utrecht University

Part 1: Introduction to CLARIN, context; searching for data with CLARIN; Virtual Language Observatory, Metadata Search; Overview of the whole course;
Part 2: I will use a concrete linguistic research question to illustrate how CLARIN and data and tools in CLARIN can be used to improve the empirical base for linguistic research;
Presentations: Introduction [pdf]; Portal & Search for data [pdf]; Search in data [pdf]

Tuesday

GrETEL – Frank Van Eynde & Liesbeth Augustinus

This session consists of two talks and a demo. The first talk describes a method to investigate syntactically annotated corpora by using linguistic examples instead of a formal query language. The second talk describes a linguistic case study using information which is only available in syntactically annotated corpora.
Treebank Mining By Example (by Liesbeth Augustinus). Presentations: GrETEL Handson
- - The recent construction of large linguistic treebanks for spoken and written Dutch (e.g. CGN, LASSY, SONAR) has created new and exciting opportunities for the empirical investigation of Dutch syntax. We will demonstrate how the GrETEL online tool allows users to consult treebanks in a linguist-friendly way. Instead of a formal search expression, a natural language example is used as input to the system, allowing users to search for constructions similar to the example they provide. We will walk through several examples, showing what type of queries can be applied.
- Number agreement in copular constructions. A treebank-grounded investigation (by Frank Van Eynde) Presentation: NumberAgreement
  - Predicate nominals canonically show number agreement with their target, as in

(1) My brother is a plumber/*plumbers.
(2) My brothers are plumbers/*a plumber.

Mismatches, however, are not excluded:

(3) He is friends with Elio di Rupo.
(4) These hooligans are a danger on the road.

The challenge is to model the agreement in a way that excludes the starred combinations in (1-2) but that allows the combinations in (3-4). A first step is to find out under which conditions the mismatches are allowed. I will show in the talk how the CGN and LASSY treebanks can be used for that purpose.

- Demo

Wednesday

TTNWW – Marc Kemps-Snijders Presentation: [PDF]; Pos-tagging+Named Entity Recognition Workflow: [zip]

Automatically enriching your data with grammatical information, such as part-of-speech tags, full syntactic structures, co-reference relations, etc. (by Marc Kemps-Snijders, Meertens Institute)

Thursday

Cornetto – Piek Vossen and Isa Maks (VU, Amsterdam). Presentation: [pdf]

In this lecture, we will explain the structure and usage of the Cornetto database for Dutch. This database consists of two collections: 1) lexical units with detailed information on specific word meanings ranging from form, morphology, syntax, case frames, combinatorics, semantics to pragmatics and 2) synsets built from synonymous lexical units and related through semantic relations. Through these two collections, Cornetto presents two different views on the organization of the language from a lexical point of view. The database is an important component for Dutch language technology but also for more fundamental research on for example semantic similarity & relatedness, modeling of opinions and text mining. During this lecture, we will use the online interface on Cornetto that was developed in CLARIN-NL: http://cornetto.inl.nl/cornetto/cornetto.html.

Friday

CLARIN-compatibility and Wrap-up – Jan Odijk

How to make your data or tools CLARIN-compatible; how to store your data/tools in the CLARIN infrastructure; the role of CLARIN-centres, types of CLARIN-centres in the Netherlands. Presentation: [pdf]
Concluding Overview. Presentation: [pdf]

Reading materials

Course readings (in order of priority)

Lecture 1:
Hinrichs, E. & S. Krauwer (2014), ‘The CLARIN Research Infrastructure: Resources and Tools for eHumanities Scholars’, LREC 2014 Proceedings LREC2014 [pdf]
Odijk, J. (2014), `CLARIN-NL : Major Results’, LREC 2014 Proceedings LREC2014 [pdf]
Uytvanck, D. van, Stehouwer, H. and Lampen, L. (2012), "Semantic metadata mapping in practice: the Virtual Language Observatory". In Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard B., Mariani J., Odijk, J. and Piperidis, S. (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), pp. 1029-1034. [pdf]

Lecture 2:

see the literature on http://nederbooms.ccl.kuleuven.be under Publications for publications on GrETEL. Most relevant for this course are

Liesbeth Augustinus, Vincent Vandeghinste, and Frank Van Eynde (2012). "Example-Based Treebank Querying" In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012). Istanbul, Turkey. pp. 3161-3167. [pdf]
Frank Van Eynde (2014). Number agreement in copular constructions. A treebank-based investigation. [pdf]

Lecture 3: *to be added*

Lecture 4:
George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller, Introduction to WordNet: An On-line Lexical Database, http://wordnetcode.princeton.edu/5papers.pdf
Vossen, P., I.Maks, R.Segers, H.van der Vliet, M.F. Moens, K.Hofmann, E.Tjong Kim Sang, M.de Rijke. “Cornetto: a lexical semantic database for Dutch“, Book chapter in: P. Spyns & J. Odijk (eds): Essential Speech and Language Technology for Dutch, Results by the STEVIN-programme, Publ. Springer series Theory and Applications of Natural Language Processing, ISBN 978-3-642-30909-0, p, 165-184
Postma M., P. Vossen. “What implementation and translation teach us: the case of semantic similarity measures in wordnets”, in: Proceedings of the 7th Global WordNet Conference (GWC2014), Orav H., Vossen, P., Ch. Fellbaum (Eds.), ISBN 978-9949-32-492-7, p. 133-142, January 25-29, 2014, Tartu, Estonia
Maks I., R. Izquierdo, F. Frontini, M. Cuadros, R. Agerri and P. Vossen. “Generating Polarity Lexicons with WordNet propagation in 5 languages”, in: Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), 26-31 May, 2014, Reykjavik, Iceland