search RSS twitter search

10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, Portorož (Slovenia)

Held under the Honorary Patronage of His Excellency Mr. Borut Pahor, President of the Republic of Slovenia

home contact

LREC2016-Box3.png Conference Venue & Travel
LREC2016-Box5.png Submission
LREC2016-Box4.png Registration
LREC2016-Box9.png Accommodation & Tours

Current List of LREC 2016 Shared LRs

Share this page!
linkedin

After the conference, the Shared LRs set at LREC 2016 was manually checked and a cleaned version of the list of LRs is now available. This list includes LRs complying with the following criteria:

  • LRs accessible (either when uploaded by the participants or when they provide an external URL for downloading the data)
  • LRs categorized as Datasets only. It can be a:
    • Corpus,
    • Language Resources/Technologies Infrastructures
    • Evaluation Data,
    • Evaluation Package
    • Ontology,
    • Lexicon,
    • Treebank.

 

 Excluded LRs are:

  • LRs uploaded when the content did not correspond to the description
  • LRs with no download URL provided or URL now a dead link
  • LRs categorized as tools or guidelines
  • LRs associated to rejected papers.

 

We added a new field in the metadata: “Conditions of use”. The value entered here indicates specific conditions of use provided by the submitter (such as Attribution, Non-commercial use, Share Alike, etc.)

Search for LRs

Filter by resource type:

 

Shared-LRs @ LREC 2016

  • Name
    A Gold Standard for Scalar Adjectives
    Resource type
    Evaluation Data
    Size
    12 Adjective Scales
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Evaluation/Validation
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    This resource contains twelve adjective scales for use in the evaluation of automatic ordering algorithms. The words on these scales were gathered from informants on Mechanical Turk. The ordering was gathered through a second task placed on Mechanical Turk.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection
    Resource type
    Corpus
    Size
    2.5 GByte
    Languages
    English (eng) French (fra) Spanish (spa)
    Production status
    Newly created-finished
    Resource usage
    Information Extraction, Information Retrieval
    License
    Creative Commons Attribution-ShareAlike 4.0 International License
    Conditions of use
    Attribution, ShareAlike
    Description
    This dataset is made available for evaluation of cross-lingual similarity detection algorithms.
    The characteristics of the dataset are the following:
    - it is multilingual: French, English and Spanish;
    - it proposes cross-language alignment information at different granularities: document-level, sentence-level and chunk-level;
    - it is based on both parallel and comparable corpora;
    - it contains both human and machine translated text;
    - part of it has been altered (to make the cross-language similarity detection more complicated) while the rest remains without noise;
    - documents were written by multiple types of authors: from average to professionals.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Amazigh annotated corpus with POS tags
    Resource type
    Corpus
    Size
    20000 tokens
    Languages
    Amazigh
    Production status
    Newly created-finished
    Resource usage
    Corpus Creation/Annotation
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    The used corpus consists of a list of texts extracted from a variety of sources. We were able to reach a total number of words superior to 20k tokens. This corpus is annotated morphologically using the tag set introduced in (Outahajala et al., 2010).
    Four annotators were involved in this task and annotation speed was between 80 and 120 tokens per hour. Our Inter Annotator Agreement is 94.98%.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    An Amazigh lexicon with POS tags
    Resource type
    Lexicon
    Size
    8000 words
    Languages
    Amazigh
    Production status
    Newly created-finished
    Resource usage
    Parsing and Tagging
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    A lexicon of about 8k words with their POS tags.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    An English-Spanish Code-Switching Twitter Corpus for Multilingual Sentiment Analysis (EN-ES-CS)
    Resource type
    Corpus
    Size
    3062 sentences
    Languages
    English (eng) Spanish (spa) Spanglish
    Production status
    Newly created-finished
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    Lesser General Public License For Linguistic Resources (LGPLLR)
    Conditions of use
    <Not Specified>
    Description
    This is the first English-Spanish code-switching Twitter corpus containing sentiment labels. Each tweet was manually annotated by three annotators according two different criteria: (1) the dual-score known as SentiStrength and (3) the de facto standard trinary scheme (positive, neutral and negative).
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Analogy questions involving rare words and proper nouns for vector space models of distributional semantics
    Resource type
    Evaluation Package
    Size
    178 KByte
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Distributional Semantics
    License
    Gnu
    Conditions of use
    Attribution
    Description
    This package provides three different evaluation sets for analogy tasks:
    - word2vec's analogy questions reconstructed with rare words,
    - two evaluation sets for proper nouns, one based on words and the other based on phrases.
    For proper nouns the evaluation on phrases should be preferred, if this is possible, since proper nouns (especially proper names) most often do not only consist of one token.

    The files are in word2vec's question form, i.e., four words per line where the first two and the second two stand in the same relation to each other.

    Please cite this thesis, if you use the data sets:
    Schlechtweg, Dominik (2015). Exploiting co-reference in distributional semantics. Master thesis. University of Stuttgart.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Apertium RDF Graph
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    Spanish (spa) French (fra) Portuguese (por) Italian (ita) Catalan (cat) galician, basque, occitan, esperanto, romanian, asturian,
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    cc-by
    Conditions of use
    Attribution
    Description
    Apertium RDF contains the RDF (Resource Description Framework) version of the Apertium bilingual dictionaries, which have been transformed into RDF and published on the Web following the Linked Data principles. The core linguistic data has been modelled using the lemon model and the translations between terms have been modelled using the lemon translation module.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    App Dialogue
    Resource type
    Corpus
    Size
    218 KByte
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Dialogue
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    The dialogues between human users and a WoZ system for performing complex tasks spanning multiple mobile apps. These tasks were extracted from users' real-life smartphone usage.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Arabic Sentiment Lexicon
    Resource type
    Lexicon
    Size
    5328 entries
    Languages
    Egyptian Arabic (arz) Modern Standard Arabic
    Production status
    Existing-used
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    <Not Specified>
    Conditions of use
    Attribution, Research only
    Description
    This lexicon contains a total of 5328 unique entries. Of those, 428 are compound negative phrases representing commonly used idioms or expression, 308 are compound positives, 3446 are single term negative words and 1104 are single term positive words. 2322(44%) of the terms and expressions in the lexicon, are Egyptian or colloquial and 3006 (56%) are Modern Standard Arabic. While most of the colloquial terms are Egyptian, a few terms from other dialects have found their way into the lexicon. Some terms that are English transliterations are also included in the lexicon, and so are mis-spellings of some frequently used terms.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Arabic Single Speaker Speech Corpus
    Resource type
    Corpus
    Size
    4.5 hours
    Languages
    South Levantine Arabic (ajp)
    Production status
    Newly created-finished
    Resource usage
    Speech Synthesis
    License
    CreativeCommons NonCommercial ShareAlike
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    Recorded in a studio and edited by a sound engineer to normalise pause length and eliminate strong amplitude changes and normalise energy throughout.
    Automatically phonetically segmented and aligned using HTK by bootstrapping the HMM models by manually aligning about 10% of the corpus. This gave accuracy of 82.5% within 20 milliseconds of true boundaries (tested on a separate part of the corpus).
    There are stress marks provided for vowels in stressed syllables. Only one stress type is covered.
    break indexes (see TOBI) were not annotated. It is assumed that the speech is continuous and stops only on pauses which are considered a separate phoneme.
    The phoneme set used in annotating the corpus and the phonetiser used were a result of PhD work by Nawar Halabi 2014-2017.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Bengali Lemmatization Dataset
    Resource type
    Evaluation Data
    Size
    197.4 KByte
    Languages
    Bengali
    Production status
    Newly created-in progress
    Resource usage
    Morphological Analysis
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    We provided two datasets for lemmatization in Bengali. The training dataset is named as "training.txt" and the test dataset is named as "test.txt".

    each line in the training and the test data consists of a context of 3 consecutive words and the gold lemma of the target surface word appropriate in that context. The format of each line is as follows.

    <preceding contextual word><TAB><the target surface word><TAB><the succeeding contextual word><TAB><the gold lemma of the target surface word appropriate for the context>

    In the training data, there are 19,159 training samples and in the test data, there are 2,126 test samples.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    CL-PL-09 corpus
    Resource type
    Corpus
    Size
    334.4 MByte
    Languages
    Dutch (nld) English (eng) French (fra) German (deu) Spanish (spa) Polish
    Production status
    Existing-used
    Resource usage
    Corpus Creation/Annotation
    License
    *License*
    The use of the corpus is granted by the NLEL and Webis as well as the licensing issues
    inherited from JRC-Aqcuis:
    http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/doc/README_Acquis-Communautaire-corpus_JRC.html#Usage

    and the Wikimedia Foundation:
    http://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License

    The original documents can be downloaded from http://wt.jrc.it/lt/Acquis/ and http://download.wikipedia.org/.
    Conditions of use
    <Not Specified>
    Description
    The corpus includes texts in Dutch, English, French, German, Polish, and Spanish. It is divided into two sections: (i) comparable, including texts on the same topic extracted from Wikipedia; and (ii) parallel, including texts extracted from the JRC-Acquis corpus. In both cases, documents on the six languages are included (be parallel or just on the same topic). The objective is considering two of the most common cross-language plagiarism detection tasks: detection of exact translations and detection of related documents.

    *Credits*
    The corpus has been created by at the Natural Language Engineering Lab. at the Universidad Politécnica de Valencia and the Web Technology and Information Systems Group, at Bauhaus-Universität Weimar.

    For citation please use the following paper (to be published):
    Martin Potthast, Alberto Barrón-Cedeño, Benno Stein, and Paolo Rosso.
    Cross-Language Plagiarism Detection. Language Resources and Evaluation. Special Issue
    on Plagiarism and Authorship Analysis.

    *Contact*

    For contact and other details, refer to:
    NLEL: http://users.dsic.upv.es/grupos/nle/
    Webis: http://www.webis.de
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    CLIPv2.1
    Resource type
    Lexicon
    Size
    6 MByte
    Languages
    Portuguese (por)
    Production status
    Newly created-in progress
    Resource usage
    Information Extraction, Information Retrieval
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    CLIPv2.1 is a fuzzy thesaurus for Portuguese discovered from several open lexical resources for this language.
    Words have fuzzy memberships to synsets, which reflect our confidence on their usage to transmit the synset meaning.
    CLIPv2.1 is also the first step towards the creation of a fuzzy Portuguese wordnet, where the attachment of semantic relations to synsets will also be fuzzy.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    CORDIS
    Resource type
    Corpus
    Size
    200 MByte
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Named Entity Recognition
    License
    CC0 1.0 Public Domain Dedication
    Conditions of use
    <Not Specified>
    Description
    CORDIS is the European Commission’s core public repository providing dissemination information for all EU-funded research projects. This dataset contains RDF conversion of the CORDIS FP7 dataset. It provides information about projects funded by the Europe
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Collected Movie Reviews in Serbian
    Resource type
    Corpus
    Size
    4725 movie reviews
    Languages
    Serbian (srp)
    Production status
    Newly created-finished
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    This is a collection of 4725 movie reviews in Serbian. The reviews are presented as individual UTF-8-encoded TXT files sorted into folders according to their source website and their score. File names are mostly given according to movie titles (and sometimes the release year). However, for three of the source websites (happynovisad.com, kakavfilm.com, popboks.com) a different naming scheme was used, based on a unique ID number for each movie review.

    Data sources
    Reviews were gathered from the following eight websites:
    * 2kokice.com
    * filmskerecenzije.com
    * filmskihitovi.blogspot.com
    * happynovisad.com
    * kakavfilm.com
    * mislitemojomglavom.blogspot.com
    * popboks.com
    * yc.rs
    This corpus contains the reviews published on these websites from their inception until 01/01/2015. Some of the reviews are written in Cyrillic, but most of them are in the Latin script.

    Scoring systems
    A 1-10 scoring system was adopted as the standard, since most websites use it. A 1-5 scoring system, used on happynovisad.com and yc.rs, was translated to 1-10 by multiplying the original scores by two. For these websites a plus/minus next to the original score was treated as an increment/decrement of the translated score. Pluses/minuses in the 1-10 scoring systems were ignored and X.5 scores were rounded down to X. In a few rare instances where a zero score was given, it was translated into a score of one.

    Score distribution
    This dataset is skewed towards the positive reviews - if scores 1-4 are treated as negative, 5-6 as neutral, and 7-10 as positive, the corpus contains 841 negative, 1278 neutral, and 2606 positive reviews.

    Subsets
    The SerbMR-2C/SerbMR-3C datasets are two-class/three-class balanced subsets of this collection.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Concepticon
    Resource type
    Corpus
    Size
    71,5 MByte
    Languages
    English (eng) French (fra) Russian (rus) Chinese (zho) German (deu)
    Production status
    Newly created-finished
    Resource usage
    Corpus Creation/Annotation
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    We present an attempt to link the large amount of different concept lists (aka “Swadesh lists”) which are used in the linguistic
    literature. This resource, the Concepticon (http://concepticon.clld.org), links 20 077 concept labels from 100 conceptlists
    to 2435 concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept
    sets are further structured by defining different relations between the concepts. The resource can be used for various purposes.
    Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a
    quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    CoreSC/ART corpus
    Resource type
    Corpus
    Size
    27 MByte
    Languages
    English (eng)
    Production status
    Existing-used
    Resource usage
    Discourse
    License
    Creative Commons
    Conditions of use
    <Not Specified>
    Description
    The CoreSC corpus consisting of 265 full papers annotated at the sentence level with a three layer annotation scheme for denoting the scientific discourse. The first layer consists of 11 core scientific concepts (Hypothesis, Method, Background, etc.). More can be found in the corpus documentation
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Corpus LSFB
    Resource type
    Corpus
    Size
    125 hours
    Languages
    French Belgian Sign Language French
    Production status
    Newly created-finished
    Resource usage
    Discourse
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    The Corpus LSFB is the first large-scale, online and open access corpus of French Belgian Sign Language. It includes the videos of the productions of 100 signers from Brussels and Wallonia (from age 18 to 95), the annotation of 10 hours out of the whole collection, the translation into French of 5 hours, the metadata and a lexical database.
    The Corpus LSFB will be publicly available on December 15th 2015.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Corpus of Interactional Data (CID)
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    French (fra)
    Production status
    Existing-used
    Resource usage
    <Not Specified>
    License
    SLDR
    Conditions of use
    <Not Specified>
    Description
    Corpus of Interactional Data (CID) is an audio-video recording of 8 hours of spontaneous French dialogs (1 hour of recording per session), and is publicly available.
    Each dialog involves two participants of the same gender, who know each other and who have more or less similar ages.
    One of the following two topics of conversation was suggested to participants: conflicts in their professional environment or unusual situations in which participants may have found themselves. These instructions have been specially selected to elicit humor in the dialogs. However, they were not exhaustive and participants often spoke very freely about other topics, in a conversational speaking style.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Czech Legal Text Treebank 1.0
    Resource type
    Treebank
    Size
    1133 sentences
    Languages
    Czech (ces)
    Production status
    Newly created-finished
    Resource usage
    Corpus Creation/Annotation
    License
    Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    We introduce a new member of the family of Prague dependency treebanks. The Czech Legal Text Treebank 1.0 (CLTT) is a morphologically and syntactically annotated corpus of 1,133 sentences. The CLTT contains texts from legal domain, namely documents from the Collection of Laws of the Czech Republic. Legal texts differ from other domains in several language phenomena, especially very long sentences occur often.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Datasets for modernising historical Basque words
    Resource type
    Lexicon
    Size
    21 KByte
    Languages
    Basque (eus)
    Production status
    Newly created-finished
    Resource usage
    Morphological Analysis
    License
    CreativeCommons (CC-BY)
    Conditions of use
    Attribution
    Description
    The historical dataset:

    - train-gero training word-form lexicon for Gero book (only non-standard words)
    - train-gero-std training word-form lexicon for Gero book (with standard words, half of them)
    - test-gero test word-form lexicon for Gero book (only non-standard words)

    The lexicons have been automatically extracted from this corpus:
    http://klasikoak.armiarma.eus/idazlanak/A/AxularGero.htm
    which was preprocessed and cleaned.
    Lexicons contain hand-validated entries. They are encoded as tab
    separated UTF-8 files with two columns

    1) wform: the wordform as it appears in the corpus, but lowecased

    2) nform: the corresponding normalised wordform
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Dispute mediation corpus (DMC)
    Resource type
    Corpus
    Size
    17304 words
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Discourse
    License
    open source
    Conditions of use
    <Not Specified>
    Description
    The Dispute Mediation Corpus (DMC) was created as part of a research project on argumentation, available at arg.tech/DMC. It comprises of 111 annotated mediation excerpts. The annotations, carried out using the Inference Anchoring Theory, elicit the dialogical and argumentative structures of the interactions. The analyses, in the Argument Interchange Format are stored in the AIFdb database making argument analyses available and exchangeable through a large variety of computational tools.
    The DMC has been analyzed using the Online Visualization of Arguments tool OVA+ and is stored in the AIFdb Corpora platform. The corpus is therefore publicly available and exchangeable. The annotation scheme, based on the IAT framework, is a set of labels that allows for showing dialogical and argumentative interrelations: 470 different argumentative moves have been analyzed, of which 366 are inferential constructions and 145 are conflicting relations.
    The DMC is divided into six sub-corpora, according to the focus of the analyses or the source of the excerpts (e.g. excerpts taken from academic publications, transcripts of real and mock mediations).
    The DMC is openly available at arg.tech/DMC where both the original text of the dialogues and the annotations can be consulted, shared and exploited by everyone
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    ELMD
    Resource type
    Corpus
    Size
    26 MByte
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Named Entity Recognition
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    ELMD is a corpus of annotated named entities from the music domain. It contains 47,254 sentences with 92,930 annotated and classified musical entities (64,873 artists, 16,302 albums, 8,275 songs and 3,480 record labels). From this set of entities, 59,680 are linked to DBpedia (46,337 artists, 7,872 albums, 3,302 songs, 2,169 record labels), with a precision of at least 0,94.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    EVALution 1.0
    Resource type
    Evaluation Data
    Size
    7500 entries
    Languages
    English
    Production status
    Newly created-in progress
    Resource usage
    Evaluation/Validation
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    Dataset containing Semantic Relations and Metadata, for Training and Evaluating Distributional Semantic Models

    WHERE TO FIND IT Please explore the branches. Every version will be saved in a new branch, accordingly named.

    Version 1.0: https://github.com/esantus/EVALution/tree/EVALution_v1.0
    Cite Us

    The resource is freely available. If you use it, please cite the description paper:

    Enrico Santus, Frances Yung, Alessandro Lenci, and Chu-Ren Huang. 2015. EVALution 1.0: An Evolving Semantic Dataset for Training and Evaluation of Distributional Semantic Models. Proceedings of the 4th Workshop on Linked Data in Linguistics, Beijing.
    Abstract

    In this paper, we introduce EVALution 1.0, a dataset designed for the training and the evaluation of Distributional Semantic Models (DSMs). This version consists of almost 7.5K tuples, instantiating several semantic relations between word pairs (including hypernymy, synonymy, antonymy, meronymy). The dataset is enriched with a large amount of additional information (i.e. relation domain, word frequency, word POS, word semantic field, etc.) that can be used for either filtering the pairs or perform an in-depth analysis of the results. The tuples are initially extracted from a combination of ConceptNet 5.0 and WordNet 4.0, and subsequently filtered through automatic methods and crowdsourcing in order to ensure their quality. The dataset is freely downloadable1 . An extension in RDF format, including also scripts for data processing, is under consideration.

    Format of the Dataset

    EVALution 1.0 is composed of two files: (1) RELATA.txt provides all the information about the relata (terms and multiword expressions); and (2) RELATIONS.txt provides all the information about the relations. Both of them are structured in a column format, where subfields are separated by either commas or slashes.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Entity salience corpus
    Resource type
    Corpus
    Size
    36200 triples
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Corpus Creation/Annotation
    License
    Creative Commons BY-NC-SA
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    An entity salience corpus providing salience annotations for entities occurring in texts. The corpus contains 128 economic news articles and 4,429 entity mentions. For each entity, via crowdsourcing, we obtained salience information. Each entity is classified with one of the three classes: most salient, less salient, not salient. The entity salience information was obtained by crowdsourcing salience information using the CrowdFlower platform. The corpus is based on the Reuters-128 entity linking corpus and we publish it in the NLP Interchange Format (NIF). http://ner.vse.cz/datasets/entitysalience-collection/
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Europarl Corpus of Native, Non-native and Translated Texts
    Resource type
    Corpus
    Size
    230 MByte
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Document Classification, Text categorisation
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    We describe a monolingual English corpus of original and (human) translated texts, with an accurate annotation of speaker properties, including the original language of the utterances and the speaker's country of origin. We thus obtain three sub-corpora of texts reflecting native English, non-native English, and English translated from a variety of European languages. This dataset will facilitate the investigation of similarities and differences between these kinds of sub-languages. Moreover, it will facilitate a unified comparative study of translations and language produced by (highly fluent) non-native speakers, two closely-related phenomena that have only been studied in isolation so far.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    European Parliament Proceedings Parallel Corpus v6 English-French-Spanish
    Resource type
    Corpus
    Size
    470 MByte
    Languages
    English (eng) French (fra) Spanish (spa)
    Production status
    Existing-used
    Resource usage
    Corpus Creation/Annotation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    A parallel corpus extracted from the European Parliament web site by Philipp Koehn (University of Edinburgh). The main intended use is to aid statistical machine translation research.

    More information can be found at http://www.statmt.org/europarl/.

    Terms of Use

    We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact pkoehn@inf.ed.ac.uk.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Finance sentiment lexicon
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    A list of sentiment words automatically created for each company that we tested on as well as oil sector specific list. Each company and sector has two sets of lists spanning three different time periods with three different emotions, positive, negative and neutral. There are two sets as the lists have come from all news articles about the companies and sector, and news articles that were only in the business section.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    German Lexemes Annotated with Senses and Substitutions (GLASS)
    Resource type
    Corpus
    Size
    2037 entries
    Languages
    German (deu)
    Production status
    Newly created-finished
    Resource usage
    Word Sense Disambiguation
    License
    Creative Commons Attribution-ShareAlike 3.0
    Conditions of use
    Attribution, ShareAlike
    Description
    German Lexemes Annotated with Senses and Substitutions (GLASS) is a German-language data set annotated with lexical substitutions and word senses from GermaNet 9.0. GLASS is an extended and corrected version of the lexical substitution data set used at the GermEval 2015: LexSub shared task at GSCL 2015.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Hungarian Verb Clusters
    Resource type
    Lexicon
    Size
    375 KByte
    Languages
    Hungarian (hun)
    Production status
    Newly created-finished
    Resource usage
    Lexicon Creation/Annotation
    License
    LGPL/CC
    Conditions of use
    <Not Specified>
    Description
    Results of spectral clustering of Hungarian verbs according to their obligatory and optional arguments, based on different raw sentence files and different number of clusters (k=128, 256, 512, 1024)
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    IITP:ReviewSentimentDataset
    Resource type
    Corpus
    Size
    5417 sentences
    Languages
    Hindi
    Production status
    Newly created-finished
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    The focus of our work is to provide a benchmark setup for creating a datasets for Aspect Based Sentiment Analysis (ABSA) in Hindi, and developing models for aspect term extraction and sentiment analysis for effective usage of this datasets. The corpus contains more than 5000 user written review sentences collected from various websites.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Index Thomisticus Treebank
    Resource type
    Treebank
    Size
    16000 sentences
    Languages
    Latin (lat)
    Production status
    Existing-updated
    Resource usage
    Lexicon Creation/Annotation
    License
    Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
    Conditions of use
    <Not Specified>
    Description
    Dependency treebank of works of Thomas Aquinas (Medieval Latin).
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Lexical Resources for English- Malayalam
    Resource type
    Language Resources/Technologies Infrastructure
    Size
    384613 parallel words
    Languages
    English (eng) Malayalam (mal)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    Creative Commons Attribution-NonCommercial 4.0 International License. The details of the license can be found at http://creativecommons.org/licenses/by-nc/4.0/
    Conditions of use
    Attribution, Non-Commercial
    Description
    Lexical Resource for English-Malayalam (c) by Sreelekha. S (www.cse.iitb.ac.in/~sreelekha/)

    The corpus contains lexical resource pairs such as phrases, verbs, lexical words etc for English-
    Malayalam with 384613 entries. These pairs were obtained programmatically extracting phrases from
    the corpus as well as by asking workers to translate English words into Malayalam. We have also used
    lexical parallel words from the Olam English-Malayalam dataset and validated it manually.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    MWE-aware English Dependency Corpus
    Resource type
    Corpus
    Size
    37000 sentences
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Parsing and Tagging
    License
    Currently confirming (Because our language resource is based on Ontonotes release-5.0 (LDC2013T19), we are currently being confirmed at LDC about the appropriate license for our resource)
    Conditions of use
    <Not Specified>
    Description
    We provide users with an English dependency corpus taking into account compound function words, which are one type of multiword expressions (MWEs) and serve as functional expressions.

    Our language resource consists of the following:
    (1) MWE-aware Dependency in CoNLL format for Wall Street Journal in Ontonotes (Stanford Dependency)
    (2) Patch for phrase structure trees in Ontonotes
    After applying this patch to the Ontonotes, each MWE will be grouped under a single subtree in the phrase structure.

    For more details, please refer to the following:
    https://github.com/naist-cl-parsing/mwe-aware-dependency
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    MetaMorpho-VerbNet Semantic Restrictions Mapping Ontology
    Resource type
    Ontology
    Size
    91 concepts
    Languages
    English (eng) Hungarian (hun)
    Production status
    Newly created-finished
    Resource usage
    Semantic Role Labeling
    License
    Attribution-ShareAlike 4.0 International
    Conditions of use
    Attribution, ShareAlike
    Description
    This ontology contains a mapping between the selectional restrictions used in the Hungarian-English verb frame database of the MetaMorpho MT system and the VerbNet class-based English verb lexicon.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Multi-prototype Chinese Character Embedding
    Resource type
    Lexicon
    Size
    90 MByte
    Languages
    Chinese (zho)
    Production status
    Newly created-finished
    Resource usage
    word embedding
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Multi-prototype Chinese character embeddings contain the character and the context embeddings
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Multilingual corpus of sense-annotated definitions
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    English, Spanish, French, Portuguese, German, Swedish, Arabic, Lezgian, Tetum, Pashto, Western Panjabi, Gilaki, Min Dong, Saterland Frisian, Meadow Mari, Lojban, Pontic, Cebuano, Abkhazian, Quechua, Afrikaans, Central Bicolano, Akan, Erzya, Amharic, Aragonese, Aramaic, Assamese, Minangkabau, Avar, Aymara, Azerbaijani, Romansh, Kirundi, Romanian, Bashkir, Mazandarani, Russian, Belarusian, Bulgarian, Kinyarwanda, Bihari, Bislama, Egyptian Arabic, Bambara, Bengali, Tibetan, Sanskrit, Breton, Sardinian, Bosnian, Sindhi, Northern Sami, Sango, Buginese, Sinhalese, Slovak, Slovenian, Samoan, Shona, West Flemish, Somali, Albanian, Catalan, Serbian, Swati, Sesotho, Chechen, Sundanese, Asturian, Rusyn, Acehnese, Swahili, Chamorro, Corsican, Ligurian, Tamil, Zazaki, Gothic, Cree, Czech, Tuvan, Telugu, Old Church Slavonic, Chuvash, Tajik, Hakka, Thai, Tigrinya, Welsh, Zeelandic, Pangasinan, Turkmen, Tagalog, Simple, Tswana, Tongan, Danish, Kapampangan, Turkish, Papiamentu, Tsonga, Hawaiian, Tatar, Kalmyk, Twi, Tahitian, Cherokee, Uyghur, Divehi, Cheyenne, Ukranian, Upper Sorbian, Dzongkha, Urdu, Ewe, Uzbek, Emilian-Romagnol, Greek, English, Esperanto, Gagauz, Estonian, Venda, Basque, Picard, Vietnamese, Novial, Gan, Volapük, Persian, Buryat (Russia), Fula, Finnish, Fijian, Silesian, Waray-Waray, Faroese, Walloon, Pennsylvania German, West Frisian, Wolof, Irish, Sorani, Scottish Gaelic, Galician, Guarani, Gujarati, Lombard, Manx, Xhosa, Nahuatl, Hausa, Neapolitan, Hebrew, Hindi, Franco-Provençal/Arpitan, North Frisian, Banjar, Norman, Croatian, Haitian, Hungarian, Yiddish, Armenian, Yoruba, Palatinate German, Interlingua, Indonesian, Interlingue, Igbo, Ido, Zhuang, Icelandic, Italian, Northern Sotho, Inuktitut, Chinese, Wu, Japanese, Zulu, Hill Mari, Javanese, Tok Pisin, Georgian, Low Saxon, Kongo, Kikuyu, Kazakh, Greenlandic, Friulian, Khmer, Kannada, Fiji Hindi, Korean, Komi-Permyak, Kashmiri, Kurdish, Komi, Cornish, Kirghiz, Ladino, Norfolk, Latin, Venetian, Luxembourgish, Luganda, Limburgish, Romani, Vepsian, Newar / Nepal Bhasa, Lingala, Lao, Lithuanian, Alemannic, Latvian, Ilokano, Lak, Moksha, Lower Sorbian, Malagasy, Maori, Macedonian, Malayalam, Udmurt, Mongolian, Moldovan, Marathi, Malay, Maltese, Karakalpak, Burmese, Kabyle, Nauruan, Nepali, Crimean Tatar, Anglo-Saxon, Sakha, Karachay-Balkar, Dutch, Latgalian, Norwegian (Nynorsk), Norwegian (Bokmål), Navajo, Bishnupriya Manipuri, Chichewa, Kabardian Circassian, Sranan, Kashubian, Occitan, Oromo, Ripuarian, Mirandese, Oriya, Ossetian, Tumbuka, Mingrelian, Punjabi, Bavarian, Piedmontese, Pali, Polish, Scots, Sicilian, Extremaduran
    Production status
    Newly created-in progress
    Resource usage
    Word Sense Disambiguation, Information Extraction, Taxonomy and Ontology Learning, Semantic Similarity, Terminology
    License
    <Not Specified>
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    A multilingual large-scale corpus of automatically disambiguated glosses drawn from different resources integrated in BabelNet (such as Wikipedia, Wiktionary, WordNet, OmegaWiki and Wikidata). Sense annotations for both concepts and named entities are provided. In total, over 40 millions definitions have been disambiguated for 264 languages.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    ORCID
    Resource type
    Corpus
    Size
    13 GByte
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Named Entity Recognition
    License
    CC0 1.0 Public Domain Dedication
    Conditions of use
    <Not Specified>
    Description
    ORCID (Open Researcher and Contributor ID) is a nonproprietary alphanumeric code to uniquely identify scientific and other academic authors. This dataset contains RDF conversion of the ORCID dataset. The current conversion is based on the 2014 ORCID data dump, which contains around 1.3 million JSON files amounting to 41GB of data.
    The converted RDF version is 13GB large (uncompressed) and it is modelled with well known vocabularies such as Dublin Core, FOAF, schema.org, etc., and it is interlinked with GeoNames.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    OSMAN UN Corpus
    Resource type
    Corpus
    Size
    100.10 MByte
    Languages
    Arabic (ara) English (eng)
    Production status
    Existing-updated
    Resource usage
    Corpus Creation/Annotation
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    73,000 parallel English and Arabic paragraphs from the United Nations (UN) corpus – a collection of resolutions of the General Assembly
    from Volume I of GA regular sessions 55-62. The Arabic text by the UN has been written with the absence of any diacritics.
    We used Mishkal to add diacritics to the Arabic text.
    Each language has around 3 million words from more than 2,000 documents with each document containing 36 paragraphs on average. The parallel documents are also available to download through the resource link.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    OSSMETER Threaded Corpus
    Resource type
    Corpus
    Size
    3.44 MByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Document Classification, Text categorisation
    License
    Apache License 2.0
    Conditions of use
    <Not Specified>
    Description
    The OSSMETER Threaded Corpus consists of 345 threads (1369 messages) from the Bugzilla server for eclipse (bugs.eclipse.org) and 820 threads (2004 messages) from the eclipse forums (news.eclipse.org). Threads were chosen randomly, irrespective of the Bugzilla product or component, or newsgroup they belong to. Each message has been annotated by 4 or more annotators with 1 or more hierarchical classes of the hierarchy presented on the associated paper. The resource contains the threads, the messages, the hierarchy and the annotations in xml format.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Ontologies of Linguistic Annotation (OLiA ontologies)
    Resource type
    Ontology
    Size
    <Not Specified>
    Languages
    <Not Specified>
    Production status
    Existing-used
    Resource usage
    Parsing and Tagging
    License
    CC-BY
    Conditions of use
    Attribution
    Description
    The Ontologies of Linguistic Annotation (OLiA) are a repository of linguistic data categories used for
    corpus annotation,
    Natural Language Processing (NLP) tools,
    machine-readable dictionaries,
    and other linguistic resources

    They formalize application-specific terms (e.g., an annotation scheme) as OWL2/DL ontologies, and provide a declarative linking with an application-independent Reference Model that then serves as a mediator to different community-maintained terminology repositories such as GOLD and ISOcat. In this function, they will serve as a central hub for linguistic data categories within the emerging Linguistic Linked Open Data cloud. OLiA provides 34 Annotation Models for more than 69 different languages or language stages covering morphology, morphosyntax, phrase structure syntax, dependency syntax, aspects of semantics, as well as recent extensions to discourse, information structure and anaphora annotation.

    The OLiA ontologies are currently being developed at the Applied Computational Linguistics (ACoLi) Lab at the Goethe University Frankfurt, Germany. Earlier development took place in the context of Collaborative Research Center "Linguistic Data Structures", (SFB 441/C2) in a collaborative effort of the universities of Tübingen, Hamburg, Potsdam, HU Berlin (2005-2008), and subsequently, at the Collaborative Research Center "Information Structure" (SFB 632/D1) with participation of the University Potsdam and the Humboldt-University Berlin (since 2007). The original goal was to document and to formalize linguistic categories for all language resources of the linguistic collaborative research centers existing at the time. Later on, different applications in corpus linguistics, natural language processing and the Semantic Web have been developed.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    PAN Plagiarism Corpus PAN-PC-11
    Resource type
    Corpus
    Size
    4.95 GByte
    Languages
    English (eng) German (deu) Spanish (spa)
    Production status
    Existing-used
    Resource usage
    Corpus Creation/Annotation
    License
    <Not Specified>
    Conditions of use
    Research purposes only
    Description
    The PAN plagiarism corpus 2011 (PAN-PC-11) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

    Reference:
    Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso.
    An Evaluation Framework for Plagiarism Detection. In Proceedings of the 23rd
    International Conference on Computational Linguistics (COLING 2010), Beijing,
    China, August 2010. Association for Computational Linguistics.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    PULO (Portuguese Unified Lexical Ontology)
    Resource type
    Ontology
    Size
    17854 synsets
    Languages
    Portuguese (por)
    Production status
    Existing-updated
    Resource usage
    Lexicon Creation/Annotation
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    PULO stands for Portuguese Unified Lexical Ontology.
    It aims to be, in its base, a free wordnet variant for the Portuguese language, enriched with further information that will not make it incompatible with other languages wordnets.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Package for Consistent Evaluation of Morphological Segmentation
    Resource type
    Evaluation Package
    Size
    102 KByte
    Languages
    Finnish (fin) Russian (rus) Turkish (tur)
    Production status
    Newly created-finished
    Resource usage
    Evaluation/Validation
    License
    Gnu
    Conditions of use
    <Not Specified>
    Description
    The package contains the code for the proposed new evaluation method as well as gold standard annotations for Finnish. Similar annotations are available for Russian and Turkish that will be uploaded with the final version.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Partial coreference resolution tool for improving NER
    Resource type
    Evaluation Package
    Size
    5,3 MByte
    Languages
    Spanish Portuguese Galician English
    Production status
    Newly created-in progress
    Resource usage
    Named Entity Recognition
    License
    GPLv3
    Conditions of use
    <Not Specified>
    Description
    CR tool with NER correction heuristics and Spanish data for replicability.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Persian Universal Dependency Treebank (Persian UD)
    Resource type
    Treebank
    Size
    151.671 tokens
    Languages
    Persian (contemporary)
    Production status
    Newly created-in progress
    Resource usage
    Parsing and Tagging
    License
    Creative Commons Attribution-ShareAlike 4.0 International license
    Conditions of use
    Attribution, ShareAlike
    Description
    Persian Universal Dependency Treebank
    ----------------------------------
    The Persian Universal Dependency Treebank (Persian UD) is the converted version of the Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015). The treebank has its original annotation scheme based on Stanford Typed Dependencies (de Marneffe et al., 2006; de Marneffe and Manning, 2008). The scheme was extended for Persian to include the language specific syntactic relations that could not be covered by the primary scheme developed for English. The treebank consists of 6000 sentence of written texts with large domain variations, in terms of different genres (containing newspaper articles, fictions, technical descriptions, and documents about culture and art) and tokenization. The variations in the tokenization are due to the orthographic variations of compound words and fixed expressions in the language. The original UPDT was developed by Mojgan Seraji, under the supervision of Joakim Nivre and Carina Jahani at Uppsala University.

    #STATISTICAL OVERVIEW OF THE PERSIAN UD:
    --------------------------------------
    Tree count: 6000
    Word count: 152918
    Token count: 151671
    Dep. relations: 37 of which 7 language specific
    POS tags: 15
    Category=value feature pairs: 14
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Port4NooJ
    Resource type
    Lexicon
    Size
    100,000 entries
    Languages
    Portuguese (por)
    Production status
    Existing-updated
    Resource usage
    Textual Entailment and Paraphrasing
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    Port4NooJ is the Portuguese language module of NooJ, whose initial linguistic resources derive from OpenLogos. OpenLogos is an open source derivative of the commercial Logos system downloadable from the DFKI website (logos-os.dfki.de), and available at INESC-ID (www.l2f.inesc-id.pt/openlogos/demo.html). The Logos system was built on the Logos Model (Scott, 2003), and Barreiro, 2011). In order to create Port4NooJ, the OpenLogos English-Portuguese dictionary was converted into NooJ format and enhanced with new properties, including derivational and morpho-syntactic and semantic relations that allowed generation of paraphrases for Portuguese (Barreiro, 2009), now feeding the linguistic engine of the eSPERTo paraphrasing system (www.esperto.l2f.inesc-id.pt) based on NooJ technology (Silberztein, 2015). Port4NooJ includes linguistic resources, such as: (i) a large coverage dictionary with English transfers; (ii) rules to formalize and document Portuguese inflectional and derivational descriptions, and (iii) local grammars, namely morphological, disambiguation, semantico-syntactic, multiword expressions, and translation grammars. Port4NooJ different components interact among them and are used to process texts. Several processing functions can be performed with these resources, among others, part of speech annotation, semantic analysis, named entity recognition, translation and paraphrasing. The module can be downloaded from the NooJ website (www.nooj-association.org). Port4NooJ v3.0 contains: (i) lexicon grammar properties of human intransitive adjectives for paraphrasing, (ii) a polarity lexicon for sentiment analysis, and (iii) a lexicon of named-entities for named-entity recognition.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Possession identification in blogs
    Resource type
    Corpus
    Size
    1 MByte
    Languages
    English
    Production status
    Newly created-in progress
    Resource usage
    Possession identification
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    This is a dataset comprising annotations for possessions in blog genre. There are three different files, based on how many of the three annotators who participated in the study agreed with a particular possession being marked. The most stringent dataset is 3+, as it requires all annotators to have agreed, while the most lax is 1+ meaning that at least one annotator has marked the possession.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    PotTS
    Resource type
    Corpus
    Size
    17 MByte
    Languages
    German (deu)
    Production status
    Newly created-finished
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    MIT
    Conditions of use
    <Not Specified>
    Description
    A novel comprehensive dataset of 7,992 German tweets which have been manually annotated with fine-grained sentiment relations. The annotation scheme used for this corpus includes such sentiment-relevant elements as opinion spans, their respective sources and targets, emotionally laden expressions and words with their possible negations and modifiers. An extended set of additional attributes associated with these elements also precisely reflects compositional changes in sentiments' polarities and intensities.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Predicate Matrix
    Resource type
    Lexicon
    Size
    406103 entries
    Languages
    English (eng) Spanish (spa) Catalan (cat) Basque (eus)
    Production status
    Existing-updated
    Resource usage
    Semantic Role Labeling
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Following the line of WordFrameNet, the Predicate Matrix is a new lexical resource resulting from the integration of multiple sources of predicate information including FrameNet, VerbNet, PropBank and WordNet.

    We start from the basis of SemLink. However, SemLink coverage is still far from complete. We apply a variety of automatic methods to extend its current coverage. Moreover, by using the Predicate Matrix, we expect to provide a more robust interoperable lexicon by discovering and solving inherent inconsistencies among the resources. We also plan to enrich WordNet with predicate information, and possibly to extend predicate information to languages other than English by exploiting the local wordnets integrated into the Multilingual Central Repository (MCR).
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Preliminary Earning Announcements
    Resource type
    Corpus
    Size
    500 KByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Corpus Creation/Annotation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    500 Preliminary Earning Announcements (PEAs) released between 2010 and 2012 by 140 firms listed on the London Stock Exchange. The documents have been split into 3500 sentences and manually annotated by Experts in accounting and finance to identify sentence tone, attribution and attribution tone.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    RTE+SR 1-7
    Resource type
    Corpus
    Size
    5 MByte
    Languages
    English
    Production status
    Newly created-in progress
    Resource usage
    Textual Entailment and Paraphrasing
    License
    CreativeCommons
    Conditions of use
    Attribution, Non-Commercial
    Description
    The Semantic Relatedness scores annotation of RTE1-7corpora
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Regulation Room Divisiveness
    Resource type
    Corpus
    Size
    15,398 words
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Argument Mining
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    The corpus is comprised of user comments extracted from an eRulemaking platform, RegulationRoom.org. Rulemaking is a multi-step process that federal agencies use to develop new regulations on health and safety, finance, and other complex topics. This dataset consists of user comments about Airline Passenger Rights (APR) rule proposed by the Department of Transportation (Corresponding rule in US federal register: USDOT (2010b) Enhancing Airline Passenger Protections, 75 FR 32318). Dialogical and discoursive relations were annotated using Online Visualization of Arguments tool - OVA+ (ova.arg-tech.org), marking support and conflict relations between propositions. The corpus consists of 977 assertions, 305 Support instances and 47 Conflict instances.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    SCATE Corpus
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Information Extraction, Information Retrieval
    License
    ODbL/DbCL
    Conditions of use
    <Not Specified>
    Description
    The Semantically Compositional Annotation of Time Expressions (SCATE) corpus contains documents that have been annotated for time concepts with a fine-grained annotation scheme that formally defines the semantics of each annotation in terms of mathematical operations over intervals on the timeline.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    SemDaX
    Resource type
    Corpus
    Size
    80,000 words
    Languages
    Danish
    Production status
    Newly created-finished
    Resource usage
    Word Sense Disambiguation
    License
    CLARIN Academic license
    Conditions of use
    Educational, teaching or research purposes only
    Description
    SemDaX is a Danish, sense annotated corpus of approx. 80,000 words. SemDaX contains six domains and comprises two subcorpora: SemDaX-Coarse (all words) and SemDaX-LexicalSample. SemDaX-Coarse comprises 3,865 sentences summing up to 76,479 running words of which 32,871 words (all nouns, verbs and adjectives) have been annotated with supersenses. 60 % of these have been doubly annotated and curated. In SemDaX-LexicalSample the number of annotated sentences for each selected noun varies according to the number of senses of the noun (100 + 15*no. of senses), thus spanning from 177 to 535 sentences per noun. All these sentences have been doubly annotated.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    SemRelData
    Resource type
    Ontology
    Size
    60,000 tokens
    Languages
    English (eng) German (deu) Russian (rus)
    Production status
    Newly created-finished
    Resource usage
    Lexicon Creation/Annotation
    License
    Creative Commons
    Conditions of use
    Attribution
    Description
    SemRelData (Semantic Relation Dataset) is a dataset focused on contextual annotation of classical semantic relations between nominals in various genres, i.e. encyclopaedic, literary and news texts, and different languages, here: English, German and Russian. The dataset is under CC-BY license.

    It consists of texts extracted from three different genres :
    encyclopaedic texts, extracted from Wikipedia; newspaper articles, extracted from Wikinews; and out-of-copyright literary texts.

    The resulting dataset consists of 13 news articles, 20 encyclopaedic articles, and snippets from 9 literary texts, all available in parallel in the three described languages, and contains approximately 60,000 tokens, 15,000 noun compounds, 3,400 annotated relations and 9,400 transitive relations.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    SerbMR-2C
    Resource type
    Corpus
    Size
    1682 movie reviews
    Languages
    Serbian (srp)
    Production status
    Newly created-finished
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    This two-class balanced sentiment analysis dataset contains 1682 movie reviews in Serbian (841 positive and 841 negative reviews). SerbMR-2C is a subset of the imbalanced "Collected Movie Reviews in Serbian" dataset and was constructed by including all 841 negative reviews from it and choosing 841 positive reviews in the manner described in our LREC 2016 paper. This balancing procedure minimizes the non-sentiment-related differences between the classes and takes into account review scores, review lengths and the differences in writing styles used on different source websites. SerbMR-2C is also a subset of the SerbMR-3C dataset as it contains the positive/negative reviews from that dataset, but not the neutral ones.

    The dataset is formatted as a Weka .arff file, with one review per line. All Cyrillic reviews were converted into the Latin script.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    SerbMR-3C
    Resource type
    Corpus
    Size
    2523 movie reviews
    Languages
    Serbian (srp)
    Production status
    Newly created-finished
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    This three-class balanced sentiment analysis dataset contains 2523 movie reviews in Serbian (841 positive, 841 neutral, and 841 negative reviews). SerbMR-3C is a subset of the imbalanced "Collected Movie Reviews in Serbian" dataset and was constructed by including all 841 negative reviews from it and choosing 841 positive and 841 neutral reviews in the manner described in our LREC 2016 paper. This balancing procedure minimizes the non-sentiment-related differences between the classes and takes into account review scores, review lengths and the differences in writing styles used on different source websites. SerbMR-3C is also a superset of the SerbMR-2C dataset as it contains all of the positive/negative reviews from that dataset, as well as the additional 841 neutral reviews.

    The dataset is formatted as a Weka .arff file, with one review per line. All Cyrillic reviews were converted into the Latin script.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Slovak Categorized News Corpus 2
    Resource type
    Corpus
    Size
    1 194 084 tokens
    Languages
    Slovak English
    Production status
    Newly created-finished
    Resource usage
    Information Extraction, Information Retrieval
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    This work proposes an information retrieval evaluation set for the Slovak language. A set of 80 queries written in the natural language is given together with the set of relevant documents. The document set contains 3980 newspaper articles sorted in 6 categories. Each document is manually annotated for relevancy with its corresponding query. The evaluation set is compatible with the Cranfield test collection using the same methodology for queries and annotation of relevancy. In addition to that, it provides annotation for document title, author, publication date and category that can be used for evaluation of automatic document clustering and categorization. Each query in the Slovak language is translated into English making it usable for evaluation of the cross-lingual information retrieval.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Spoken Wikipedia
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    German (deu) English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Speech Recognition/Understanding
    License
    CC-BY-SA
    Conditions of use
    Attribution, ShareAlike
    Description
    time-aligned Spoken Wikipedia data
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Story Dialogue with Gestures (SDG) Corpus
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    English
    Production status
    Newly created-in progress
    Resource usage
    Storytelling, Dialog Generation
    License
    None
    Conditions of use
    <Not Specified>
    Description
    This is a sample of the Story Dialogues with Gestures corpus. It contains a protest story and a storm story with full annotations.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Syntactic Reference Corpus of Medieval French (SRCMF)
    Resource type
    Corpus
    Size
    300000 words
    Languages
    Old French (842-ca. 1400) (fro)
    Production status
    Newly created-finished
    Resource usage
    Parsing and Tagging
    License
    CreativeCommons (for details refer to the SRCMF website)
    Conditions of use
    <Not Specified>
    Description
    The Syntactic Reference Corpus of Medieval French (SRCMF) is a manually annotated dependency treebank. It contains texts from the Base de Français Médiéval and the Nouveau Corpus d'Amsterdam. Syntactic dependencies were added and revised manually. The corpus is available in several formats (RDF graphs, XML, compiled for queries with TIGERSearch etc.)
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Syntactically Annotated Wikipedia Dump
    Resource type
    Corpus
    Size
    17 GByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Information Extraction, Information Retrieval
    License
    Gnu, CreativeCommons (same as Wikipedia)
    Conditions of use
    <Not Specified>
    Description
    A Syntactically Annotated Wikipedia dump (SAWD) from November 2014. Contains PoS tags, dependencies and the constituent-based parse tree for each word, using the StanfordCoreNLP suit of tools.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    TPR-DB/ENJA15
    Resource type
    Corpus
    Size
    30 hours
    Languages
    English Japanese
    Production status
    Newly created-in progress
    Resource usage
    Translation Process Research
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    The CRITT Translation Process Database (TPR-DB) is a publicly available database of recorded translation sessions for Translation Process Research (TPR). It contains user activity data (UAD) of translators behavior collected in approximately 30 translation (and text production) studies with Translog-II and with the CASMACAT workbench. This data acquisition software logs keystrokes and gaze data during text perception and text production. The data currently amounts to more than 500 hours of text production gathered in more than 1400 sessions. In addition to the raw logging data, a post-processed database (TPR-DB) is made available which compiles this data into a set of tab separated tables that can be more easily processed by various visualization and analysis tools.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Tempo-IndoWordnet
    Resource type
    Ontology
    Size
    22.7 MByte
    Languages
    Hindi
    Production status
    Newly created-in progress
    Resource usage
    Information Extraction, Information Retrieval
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    This is a temporally annotated dataset for Hindi. Each synset of Hindi WordNet has been classified. Detailed descriptions along with ReadMe are available in the project link http://dipaweshpawar.github.io/ProjectHWTT.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    The French Question Bank
    Resource type
    Treebank
    Size
    2600 sentences
    Languages
    French
    Production status
    Newly created-finished
    Resource usage
    Evaluation + Parsing/Tagging increased coverage
    License
    Creative Common (NC-BY-SA)
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    We present the French Question Bank, a treebank of 2600 questions. It is available in
    - constituency
    - surface dependency
    - deep dependency (not presented in this abstract)
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    The LetsRead Corpus
    Resource type
    Corpus
    Size
    2 GByte
    Languages
    Portuguese (por)
    Production status
    Newly created-finished
    Resource usage
    Speech Recognition/Understanding
    License
    CreativeCommons Attribution-ShareAlike 4.0 International
    Conditions of use
    Attribution, ShareAlike
    Description
    Speech Corpus of European Portuguese read speech by 284 children aged 6-10 years old from the 1st to 4th school grades. The data amounts to around 20h of speech with 5h30m fully annotated. Reading tasks are sentences and pseudowords. Several reading disfluencies are identified with the purpose of reading performance evaluation.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    TwitterBuonaScuola Corpus
    Resource type
    Corpus
    Size
    6,659 tweets
    Languages
    Italian
    Production status
    Newly created-finished
    Resource usage
    Opinion Mining/Sentiment Analysis
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    the TWitterBuonaScuola corpus (TW-BS) is a novel Italian linguistic resource for Sentiment Analysis, developed with the main aim of analyzing the online debate on the controversal Italian political reform "Buona Scuola" (Good school). The corpus is composed by 6,659 tweets. For each of them we provided an annotation of polarity, irony, and topic.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Universal Dependencies
    Resource type
    Treebank
    Size
    250 000 sentences
    Languages
    Basque Bulgarian Croatian Czech Danish English, Finnish, French, German, Greek, Hebrew, Hungarian, Indonesian, Irish, Italian, Persian, Spanish, Swedish
    Production status
    Newly created-in progress
    Resource usage
    Parsing and Tagging
    License
    Various licenses but mostly CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    Universal Dependencies is a multilingual collection of treebanks with cross-linguistically consistent annotation. Version 1.1 consists of 19 treebanks representing 18 languages.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Uppsala Persian Dependency Treebank (UPDT)
    Resource type
    Treebank
    Size
    151,671 tokens
    Languages
    Persian (contemporary)
    Production status
    Existing-updated
    Resource usage
    Parsing and Tagging
    License
    Creative Commons Attribution 3.0 License
    Conditions of use
    Attribution
    Description
    Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015, Chapter 5, pp. 97-146) is a dependency-based syntactically annotated corpus. The treebank consists of 6000 sentences (151,671 tokens) of written text in CoNLL-format and is developed through a bootstrapping procedure involving the open source data-driven dependency parser MaltParser (Nivre et al., 2006), and manual validation of the annotation. The treebank was first released in 2013.

    The treebank data is extracted from the open source, validated Uppsala Persian Corpus (UPC) (Seraji, 2015, Chapter 3, pp. 68-81) created from on-line material containing newspaper articles and common text on various topics (e.g. culture, technology, fiction, and art). The corpus is annotated with 31 part-of-speech tags.

    The treebank annotation scheme is based on Stanford Typed Dependencies (de Marneffe et al., 2006; de Marneffe and Manning, 2008). The entire dependency relations used in the annotation including an extensive guidelines for sentence segmentation, tokenization, and morphological annotation are described in detail in Seraji (2015) ; for the treebank section see Chapter 5, pp. 97-146, and for the section related to the sentence segmentation, tokenization, and morphological annotation see Chapter 3, pp. 68-81.

    References:
    -------------
    # De Marneffe, Marie-Catherine, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC).

    # De Marneffe, Marie-Catherine, and Christopher D. Manning. 2008. Stanford Typed Dependencies Representation. In Proceedings of the COLING’08 Workshop on Cross-Framework and Cross-Domain Parser Evaluation.

    # Nivre J., Hall J., and Nilsson J. 2006. Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC).

    # Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Urdu Summary Corpus
    Resource type
    Corpus
    Size
    9.84 MByte
    Languages
    Urdu (urd)
    Production status
    Newly created-finished
    Resource usage
    Summarisation
    License
    GNU
    Conditions of use
    <Not Specified>
    Description
    Urdu Summary Corpus

    Urdu summary corpus consists of 50 articles collected from various blogs. From the original HTML documents only unformatted content text was kept, removing all other things. We provide abstractive summaries of these 50 articles. After normalization, we further applied different NLP tools on the articles to generate part-of-speech tagged, morphologically analyzed, lemmatized and stemmed articles.

    Urdu Summary Corpus Tools

    + Normalization is taken from [1], Diacritic marks are also removed in this step.
    + Table-lookup based Morphological Analyzer and lemmatizer is built from [3].
    + Stemmer is built from [1]
    + Table-lookup based POS tagger is built from [4]. We used unigram and bigram counts.

    Commands:

    Unzip USCTools.zip

    Open Console

    Go to USCTools directly typing: cd USCTools

    For Normalization

    $ java -cp bin USCTools normalize input.txt output.txt

    For Lemmatization

    $ java -cp bin USCTools lemmatize input.txt output.txt

    For Morphological analysis

    $ java -cp bin USCTools morph_analysis input.txt output.txt

    For stemming by Assas-Band

    $ java -cp bin USCTools stemming input.txt output.txt

    For POS tagging

    $ java -cp bin USCTools tagging input.txt output.txt


    [1] Q.-u.-A. Akram, A. Naseer, and S. Hussain. Proceedings of the 7th Workshop on Asian Language Resources (ALR7), chapter Assas-band, an Affix- Exception-List Based Urdu Stemmer, pages 40-47. Association for Computational Linguistics, 2009.

    [2] A. Gulzar. Urdu normalization utility v1.0. Technical report, Center for Language Engineering, Al-kwarzimi Institute of Computer Science (KICS), University of Engineering, Lahore, Pakistan. http://www.cle.org.pk/software/langproc/urdunormalization.htm, 2007.

    [3] M. Humayoun, H. Hammarström, and A. Ranta. Urdu morphology, orthography and lexicon extraction. CAASL-2: The Second Workshop on Computational Approaches to Arabic Script-based Languages, LSA Linguistic Institute. Stanford University, California, USA., pages 21-22, 2007. http://www.lama.univ-savoie.fr/ humayoun/UrduMorph/.

    [4] B. Jawaid, A. Kamran, and O. Bojar. A tagged corpus and a tagger for urdu. In N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, may 2014. European Language Resources Association (ELRA). https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    VerLexPor
    Resource type
    Lexicon
    Size
    7232 sentences
    Languages
    Portuguese (por)
    Production status
    Newly created-in progress
    Resource usage
    Evaluation/Validation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Walenty
    Resource type
    Lexicon
    Size
    15000 entries
    Languages
    Polish (pol)
    Production status
    Newly created-in progress
    Resource usage
    Parsing and Tagging
    License
    CreativeCommons BY
    Conditions of use
    Attribution
    Description
    Walenty, a comprehensive valence dictionary of Polish, is developed at the Institute of Computer Science. The dictionary is meant to be both human- and machine-readable; in particular, it is being employed by two parsers of Polish.

    On syntactic level, Walenty exhibits a number of novel features, when compared to other such dictionaries, such as the structural case, clausal subjects, distributive po, complex prepositions, comparative constructions, and control and raising. Each lexical entry consists of a number of valence schemata, and each schema is a set of syntactic positions. The notion of a syntactic position is based on the coordination test and takes into consideration the possibility of diverse morphosyntactic realisations.

    Semantic layer is composed of semantic frames. Each frame is a set of semantic arguments being pairs <semantic role, selectional preferences>. Each frame is connected to a meaning of a predicate. Those meanings are beeing identified by PolishWordNet lexical units (LUs). It is possible that multiple LUs correspond to the same frame.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    Webis-CLS-10
    Resource type
    Corpus
    Size
    12.2 MByte
    Languages
    English (eng) French (fra)
    Production status
    Existing-used
    Resource usage
    Emotion Recognition/Generation
    License
    <Not Specified>
    Conditions of use
    Attribution
    Description
    The dataset comprises about 800.000 Amazon product reviews for three product categories---books, dvds and music---written in two different languages: English, French. The French reviews were crawled from Amazon in November, 2009. The English reviews were sampled from the Multi-Domain Sentiment Dataset (Blitzer et. al., 2007). For more information about the construction of the dataset see Prettenhofer and Stein (2010).

    For each language-category pair there are three sets of training documents, test documents, and unlabeled documents. The training and test sets comprise 2.000 documents each, whereas the number of unlabeled documents varies from 9.000 - 170.000.

    References
    ----------
    Peter Prettenhofer, Benno Stein. Cross-Language Text Classification using Structural Correspondence Learning. Association of Computational Linguistics (ACL), 2010.
    John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    WikiParq
    Resource type
    Corpus
    Size
    90 GByte
    Languages
    French English Swedish German Spanish Russian
    Production status
    Newly created-finished
    Resource usage
    Corpus Creation/Annotation
    License
    CreativeCommons
    Conditions of use
    Attribution, ShareAlike
    Description
    WikiParq is a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the infoboxes of the articles in French, or all the first paragraphs of the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish.

    Currently, six versions of Wikipedia are available as tarball archives in the WikiParq format.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    apertium-tyv
    Resource type
    Lexicon
    Size
    11000 lexemes
    Languages
    Tuvan
    Production status
    Newly created-in progress
    Resource usage
    Morphological Analysis
    License
    GNU GPL
    Conditions of use
    <Not Specified>
    Description
    Finite-state morphological analyser for Tuvan
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    babelnet+adagram matching: evaluation dataset for 50 words
    Resource type
    Evaluation Data
    Size
    50 words
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Evaluation/Validation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Manual mapping of senses. We built an evaluation dataset for 50 ambiguous words. Some of these 50 words, such as "bank" and "plant" are commonly used in word sense disambiguation evaluations. Others, like "delphi" or "python" may refer to both nouns and named entities.

    For each of these words, we retrieved all BabelNet and AdaGram senses. We generated all 1751 possible matching combinations for these 50 words and annotated all pairs of BabelNet-AdaGram senses. As mentioned above, very often, BabelNet has much more specific senses than AdaGram. For instance, as the maximum number of senses for AdaGram is five, it would not be able to learn separate senses for "apple (tree)", "apple (fruit)" and "apple fruit as a symbol". During annotation, mapping of the AdaGram sense "cherry, avocado, fruit, peach, ..." to these three would be considered as correct, while mapping them to the "macintosh, hardware, pc, microsoft, ..." sense would be considered incorrect. The final dataset contains 241 sense alignments, i.e. 14% of all possible combinations.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    blame and praise dataset
    Resource type
    Corpus
    Size
    489 KByte
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Document Classification, Text categorisation
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    ISEAR dataset annotated for blame and praise classification.
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    fienWaC v1.0
    Resource type
    Corpus
    Size
    4309304 sentences
    Languages
    Finnish (fin) English (eng)
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    hrenWaC v2.0
    Resource type
    Corpus
    Size
    2333464 sentences
    Languages
    Croatian (hrv) English (eng)
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    metaTED
    Resource type
    Corpus
    Size
    16 MByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Discourse
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    metaTED is composed of 16 xml files, one corresponding to each category of metadiscourse. Each file is composed of 180 talks identified by the official TED talk ID, <talk id=...> </talk>. Each talk is a sequence of words with the attribute "selected", which can be 0,1,2, or 3, depending on how many workers identified that word as being part of that specific metadiscursive act. More data, including annotation statistics, can be provided upon request to rcorreia@andrew.cmu.edu
    Download from
    Referring paper
    Edition
    LREC 2016
  • Name
    typing_game_logs
    Resource type
    Language Resources/Technologies Infrastructure
    Size
    679 KByte
    Languages
    <Not Specified>
    Production status
    Existing-used
    Resource usage
    Web Services
    License
    cc-by-4.0 (http://creativecommons.org/licenses/by/4.0/)
    Conditions of use
    Attribution
    Description
    Typing_game_logs

    This data contains logs extracted from our typing game. This data is in csv format. This data is in UTF-8 (LF). This data is arranged in chronological order.
    The first column displays strings typing game users actually typed. These strings may include an uppercase letter "B". A uppercase letter "B" shows that typing game users hit Backspace key.
    The second column displays strings typing game users sent as they saw. These strings do not include an uppercase letter "B" and a letter removed by a Backspace key.
    The third column displays English word shown in our typing game.
    The fourth column displays time (ms) that the user spent to input one word.
    The fifth column displays user name (anonymous) in the typing game.

    You can use this data if you follow a License under the cc-by-4.0 (http://creativecommons.org/licenses/by/4.0/).

    Typing_game_logs
    Copyright 2015 Ryuichi Tachibana and Mamoru Komachi
    License under the cc-by-4.0
    Download from
    Referring paper
    Edition
    LREC 2016

Important Dates

  • 25 October 2015: Submission of proposals for oral and poster papers
  • 25 October 2015: Submission of proposals for panels, workshops and tutorials
  • 27 November 2015: Notification of acceptance of worksohps and tutorials
  • 1st February 2016: Notification of accepted papers
  • 18th February 2016: Online Registration
  • 17th March 2016: Final Submission Deadline
  • 23-24 May 2016: Pre-conference Workshops & Tutorials
  • 25-26-27 May  2016: Main Conference
  • 28 May 2016: Workshops & Tutorials

Latest Tweets