Part-of-Speech Tagged Building Codes Dataset

Project member? Login to members area.


This dataset of Part-of-Speech (POS) tagged building codes contains 1,522 sentences from Chapters 5 and 10 of 2015 International Building Code. It adopts the original version of Penn Treebank tag set for the POS tags. It includes tagging results from 5 human annotators and 7 machine taggers. It also provides the most commonly chosen POS tag for each word by machine taggers and by human annotators. For detailed explanations of the meanings of the POS tags, please refer to Building a Large Annotated Corpus of English: The Penn Treebank [1]. For an explanation of the development of this dataset, please refer to the following paper [2].

1. Marcus, Mitchell & Ann Marcinkiewicz, Mary & Santorini, Beatrice. (2002). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics. 19. 313-330.

2. Xue, X., and Zhang, J. (2019). "Evaluation of Seven Part-of-Speech Taggers in Tagging Building Codes: Identifying the Best Performing Tagger and Common Sources of Errors." Proc., ASCE Construction Research Congress, ASCE, Reston, VA, submitted.



The Purdue University Research Repository (PURR) is a university core research facility provided by the Purdue University Libraries, the Office of the Executive Vice President for Research and Partnerships, and Information Technology at Purdue (ITaP).