Part-of-Speech Tagged Building Codes (PTBC)

Listed in Datasets

By Xiaorui Xue1, Jiansong Zhang1

Purdue University

This a natural language dataset of Part-of-Speech (POS) tagged building codes. It includes 1,522 sentences of text from Chapters 5 and 10 of 2015 International Building Code. It adopted the Penn Treebank tag set.

Version 1.0 - published on 26 Aug 2019 doi:10.4231/Y0ZQ-4946 - cite this Content may change until committed to the archive on 26 Sep 2019

Licensed under CC0 1.0 Universal

Description

This dataset of Part-of-Speech (POS) tagged building codes contains 1,522 sentences from Chapters 5 and 10 of 2015 International Building Code. It adopts the original version of Penn Treebank tag set for the POS tags. It includes tagging results from 5 human annotators and 7 machine taggers. It also provides the most commonly chosen POS tag for each word by machine taggers and by human annotators. For detailed explanations of the meanings of the POS tags, please refer to Building a Large Annotated Corpus of English: The Penn Treebank [1]. For an explanation of the development of this dataset, please refer to the following paper [2].

1. Marcus, Mitchell & Ann Marcinkiewicz, Mary & Santorini, Beatrice. (2002). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics. 19. 313-330.

2. Xue, X., and Zhang, J. (2019). "Evaluation of Seven Part-of-Speech Taggers in Tagging Building Codes: Identifying the Best Performing Tagger and Common Sources of Errors." Proc., ASCE Construction Research Congress, ASCE, Reston, VA, submitted.

Cite this work

Researchers should cite this work as follows:

Tags

The Purdue University Research Repository (PURR) is a university core research facility provided by the Purdue University Libraries, the Office of the Executive Vice President for Research and Partnerships, and Information Technology at Purdue (ITaP).