14th Annual Workshop of
The Australasian Language Technology Association

Monash University, Melbourne

5th - 7th December 2016

Please see the draft proceedings for the paper PDFs; soon to be relocated onto the ACL anthology.


Monday 5th - Monash Caulfield B2.14

9:00 Tutorial 1a: NP Bayes
10:15 Morning tea
10:45 Tutorial 1b: NP Bayes
12:15 Lunch
13:30 Tutorial 2a: Succinct data struct.
15:15 Afternoon tea
15:45 Tutorial 2b: Succinct data struct.
16:45 End of session

Tuesday 6th - Monash Caulfield B2.14

Session 1: Opening & Invited talk
9:00 Opening
9:15 Invited talk: Mark SteedmanOn Distributional Semantics
10:15 Morning tea
Session 2: Translation (Chair: Stephen Wan)
10:45 Presentation: Kyo Kageura, Martin Thomas, Anthony Hartley, Masao Utiyama, Atsushi Fujita, Kikuko Tanabe and Chiho ToyoshimaSupporting Collaborative Translator Training: Online Platform, Scaffolding and NLP
11:10 Presentation : Nitika Mathur, Trevor Cohn and Timothy BaldwinImproving Human Evaluation of Machine Translation
11:25 Paper: Cong Duy Vu Hoang, Reza Haffari and Trevor CohnImproving Neural Translation Models with Linguistic Factors
11:40 Presentation : Daniel Beck, Lucia Specia and Trevor CohnExploring Prediction Uncertainty in Machine Translation Quality Estimation
11:55 CLEF eHealth 2017 Shared tasks
12:00 Lunch
Session 3a: Invited talk (Chair: Hanna Suominen)
13:15 Invited talk: Hercules DalianisHEALTH BANK: A Workbench for Data Science Applications in Healthcare
13:55 Break
Session 3b: Health (Chair: Hanna Suominen)
14:00 Presentation : Raghavendra Chalapathy, Ehsan Zare Borzeshi and Massimo PiccardiAn Investigation of Recurrent Neural Architectures for Drug Name Recognition
14:15 Paper: Hamed Hassanzadeh, Anthony Nguyen and Bevan KoopmanEvaluation of Medical Concept Annotation Systems on Clinical Records
14:30 Paper: Mahnoosh Kholghi, Lance De Vine, Laurianne Sitbon, Guido Zuccon and Anthony NguyenThe Benefits of Word Embeddings Features for Active Learning in Clinical Information Extraction
14:45 Presentation : Rebecka Weegar and Hercules DalianisMining Norwegian pathology reports: A research proposal
15:00 Paper: Pin Huang, Andrew MacKinlay and Antonio JimenoSyndromic Surveillance using Generic Medical Entities on Twitter
15:15 Paper: Yufei Wang, Stephen Wan and Cecile ParisThe Role of Features and Context on Suicide Ideation Detection
15:30 Afternoon tea
Session 4: Relation & Information extraction (Chair: Andrew MacKinlay)
16:00 Presentation : Dat Quoc Nguyen and Mark JohnsonModeling topics and knowledge bases with embeddings
16:15 Paper: Zhuang Li, Lizhen Qu, Qiongkai Xu and Mark JohnsonUnsupervised Pre-training With Seq2Seq Reconstruction Loss for Deep Relation Extraction Models
16:30 Presentation : Hanieh Poostchi, Ehsan Zare Borzeshi and Massimo PiccardiPersoNER: Persian Named-Entity Recognition
16:45 Paper: Nagesh C. Panyam, Karin Verspoor, Trevor Cohn and Rao KotagiriASM Kernel: Graph Kernel using Approximate Subgraph Matching for Relation Extraction
17:00 Paper: Gitansh Khirbat, Jianzhong Qi and Rui ZhangN-ary Biographical Relation Extraction using Shortest Path Dependencies
17:15 End of session
18:00 Dinner

Wednesday 7th - Monash Caulfield B2.14

Session 5: Invited talk & Shared task (Chair: Diego Molla)
9:00 Invited talk: Steven BirdGetting started with an Australian language
9:45 Shared Task
Andrew Chisholm, Ben Hachey and Diego MolláOverview of the 2016 ALTA Shared Task: Cross-KB Coreference
Gitansh Khirbat, Jianzhong Qi and Rui ZhangDisambiguating Entities Referred by Web Endpoints using Tree Ensembles
S. Shivashankar, Yitong Li and Afshin RahimiFilter and Match Approach to Pair-wise Web URI Linking
Cheng Yu, Bing Chu, Rohit Ram, James Aichinger, Lizhen Qu and Hanna SuominenPairwise FastText Classifier for Entity Disambiguation
10:15 Morning tea
Session 6: Short-papers & posters (Chair: Karin Verspoor)
10:45 Short-paper lightning talks
Aditya Joshi, Vaibhav Tripathi, Pushpak Bhattacharyya, Mark Carman, Meghna Singh, Jaya Saraswati and Rajita ShuklaHow Challenging is Sarcasm versus Irony Classification?: A Study With a Dataset from English Literature
Ming Liu, Gholamreza Haffari and Wray BuntineLearning cascaded latent variable models for biomedical text classification
Bo Han, Antonio Jimeno Yepes, Andrew MacKinlay and Lianhua ChiTemporal Modelling of Geospatial Words in Twitter
Antonio Jimeno Yepes and Andrew MacKinlayNER for Medical Entities in Twitter using Sequence to Sequence Neural Networks
Dat Quoc Nguyen, Mark Dras and Mark JohnsonAn empirical study for Vietnamese dependency parsing
Will Radford, Ben Hachey, Bo Han and Andy Chisholm:telephone::person::sailboat::whale::okhand: ; or “Call me Ishmael” – How do you translate emoji?
Xavier Holt, Will Radford and Ben HacheyPresenting a New Dataset for the Timeline Generation Problem
11:10 Poster Session
12:00 Lunch
13:15 Business Meeting
Session 7: Applications (Chair: Trevor Cohn)
13:35 Paper: Hafsah Aamer, Bahadorreza Ofoghi and Karin VerspoorSyndromic Surveillance through Measuring Lexical Shift in Emergency Department Chief Complaint Texts
13:50 Paper: Rui Wang, Wei Liu and Chris McDonaldFeatureless Domain-Specific Term Extraction with Minimal Labelled Data
14:05 Presentation : Ehsan ShareghiUnbounded and Scalable Smoothing for Language Modeling
14:30 Paper: Shunichi IshiharaAn Effect of Background Population Sample Size on the Performance of a Likelihood Ratio-based Forensic Text Comparison System: A Monte Carlo Simulation with Gaussian Mixture Model
14:45 Presentation: Oliver Adams, Shourya Roy and Raghu KrishnapuramDistributed Vector Representations for Unsupervised Automatic Short Answer Grading
15:00 Paper: Andrei Shcherbakov, Ekaterina Vylomova and Nick ThiebergerPhonotactic Modeling of Extremely Low Resource Languages
15:15 Presentation: Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird and Trevor CohnCross-Lingual Word Embeddings for Low-Resource Language Modeling
15:30 Afternoon tea
Session 8: Closing
16:00 Awards for best paper and best presentation
16:10 ALTA Closing
16:25 End of session

Invited talks

Mark Steedman (University of Edinburgh)

On Distributional Semantics

The central problem in open domain-question answering from text is the problem of entailment. Given enough text, the answer is almost certain to be there, but is likely to be expressed in a different form from the one the question suggest-either in a paraphrase, or in a sentence that entails or implies the answer.

We cannot afford to bridge this gap by open-ended theorem-proving search. Instead we need a semantics for natural language that directly supports common-sense inference, such as that arriving somewhere implies subsequently being there, and invading a country implies attacking it. We would like this semantics to be compatible with traditional logical operator semantics including quantification, negation and tense, so that not being there implies not having arrived, and not attacking implies not invading.

There have been many attempts to build such a semantics of content words by hand, from the generative semantics of the '60s to WordNet and other resources of the present. The '60s saw attempts based on generative semantics, while more recently, they have engendered WordNet and other computational resources. However, such systems have remained incomplete and language-specific in comparison to the vastness of human common-sense reasoning. One consequence has been renewed interest in the idea of treating the semantics as "hidden", to be discovered through machine learning, an idea that has its origins in the "semantic differential" of Osgood, Suci, and Tannenbaum in the '50s.

There are two distinct modern approaches to the problem of data-driven or "distributional" semantics. The first, which I will call "collocational", is the direct descendant of the semantic differential. In its most basic form, the meaning of a word is taken to be a vector in a space whose dimensions are defined by the lexicon of the language, and whose magnitude is defined by counts of those lexical items within a fixed window over the string (although in practice the dimensionality is reduced and the relation to frequency less direct). Crucially, semantic composition is defined in terms of linear algebraic operations, notably vector addition.

A second "denotational" approach defines the meaning of a word in terms of the entities that it is predicated over and the ensembles of predications over entities of the same types, obtained by machine-reading with wide coverage parsers. (Names or designators in text are generally used as a proxy for the entities themselves.) Semantic composition can then be defined as an applicative system using logical opertors such as quantifiers and negation, as in traditional formal semantics.

The talk reviews recent work in both collocation- and denotation- based distributional semantics, and asks for each what dimensions of meaning are actually being represented. It argues that the two approaches are largely orthogonal on these dimensions. Collocational representations are good for representing ambiguity, with linear algebraic composition most effective at disambiguation and representing distributional similarity. Denotational representations represent something more like a traditional compositional semantics, but one in which the primitive relations correspond to those of a hidden language of logical form representing paraphrase and common-sense entailment directly.

To make this point, the talk discusses recent work in which collocational distributional representations such as embeddings have been used as proxies for semantic features in models such as LSTM, to guide disambiguation during parsing, while a lexicalized denotation-based distributional semantics is used to support inference of entailment. I will show that this hybrid approach can be applied with a number of parsing models, including transition-based and supertagging, to support entailment-based QA with denotation-based distributional representations. I will discuss work at Edinburgh and elsewhere in which the semantics of paraphrases is represented by a single cluster identifier, and where common-sense inference (derived from a learned entailment graph) is built into the lexicon and projected by syntactic derivation, rather than delegated to a later stage of inference. The method can be applied cross-linguistically, in support of machine translation. Ongoing work extends the method to extract multi-word items, light-verb constructions, and an aspect-based semantics for temporal/causal entailment, and to the creation and interrogation of Knowledge Graphs and Semantic Nets via natural language.

Hercules Dalianis (Stockholm University)

HEALTH BANK: A Workbench for Data Science Applications in Healthcare

Healthcare has many challenges in form of monitoring and predicting adverse events as healthcare associated infections or adverse drug events. All this can happen while treating a patient at the hospital for her disease. The research question is: When and how many adverse events have occurred, how can one predict them? Nowadays all information is contained in the electronic patient records and are written both in structured form and in unstructured free text. This talk will describe the data used for our research in HEALTH BANK - Swedish Health Record Research Bank containing over 2 million patient records from 2007-2014. Topics are detection of symptoms, diseases, body parts and drugs from Swedish electronic patient record text, including deciding on the certainty of a symptom or disease and detecting adverse (drug) events. Future research are detecting early symptoms of cancer and de-identification of electronic patient records for secondary use.

Steven Bird (University of Melbourne, University of California Berkeley)

Getting started with an Australian language

At least a dozen Australian indigenous languages are still being learnt by children as their first language. These children have limited access to western-style education and often gain only limited proficiency in English. The languages are effectively unwritten, as there are no naturally occurring contexts where people would need to write the language. The same situation is repeated around the world, where remote communities do not write their language and do not acquire the national language, and government and NGO employees who work with these communities must learn to speak an unwritten language without the help of written resources. In this presentation I will report on early experiences working with Kunwinjku, a polysynthetic language spoken by 1,200 people in western Arnhem Land, leading to several open research questions in the area of tools for adult learners of unwritten languages.