On April 24 I attended the CLARIN ERIC workshop on Federated Content Search at the Centre for Language Technology in Copenhagen. Federated Content Search aims to make it possible to search for data in multiple datasets that can reside on different locations via a single query and without requiring detailed knowledge about the structure of the datasets (which may differ among them). This page sketches the architecture envisaged for FCS in CLARIN. FCS is a very hard problem, and CLARIN is working at the edge of the scientific and engineering knowledge in this domain. It also imposes very hard requirements of formal and semantic interoperability. FCS functionality, however, is highly desirable in the CLARIN infrastructure because it promises to make searching in a wide variety of distributed data easy and user-friendly.
The workshop was well organized, and held in a nice environment. We thank the local CST committee and the workshop organizers for this. The workshop was excellently moderated by Christoph Draxler from the Bavarian Archive for Speech Signals (BAS). At the workshop, there were technical presentations, presentations of example use cases, and a lot of fruitful discussion. It became clear that many of the use cases presented are too difficult for FCS at this stage (including, unfortunately, many of the ones I described), and that a clearly planned, incremental road has to be walked to make progress in this area.
One possibility to approach the problem is by clearly distinguishing at least three categories of data (and associated possible queries):
- corpora with relatively simple annotations of text (e.g. only annotations associated to tokens)
- treebanks
- lexicons
For example, for annotated text it would be ideal if FCS could deal with queries allowed by the Corpus Query Language (CQL). Unfortunately, in this stage this is more an ambitious goal rather than reality.
Research into and work towards FCS must continue, as must the work on making existing and new datasets formally and semantically interoperable. FCS will in the near future offer only limited search functionality. Therefore, CLARIN must systematically offer and continue to offer alternatives in the form of resource-specific search engines. Hopefully, these can then gradually be replaced by FCS.