TY - JOUR
T1 - Excavating Grey Literature
T2 - a case study on the rich indexing of archaeological documents via Natural Language Processing techniques and Knowledge Based resources
AU - Vlachidis, Andreas
AU - Tudhope, Douglas
AU - Binding, Ceri
AU - May, Keith
PY - 2010/12/31
Y1 - 2010/12/31
N2 - The paper discusses the use of Information Extraction (IE), a Natural Language Processing (NLP) technique to assist rich semantic indexing of diverse archaeological text resources. The focus of the research is to direct a semantic-aware rich indexing of diverse natural language resources with properties capable of satisfying information retrieval from on-line publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project. The paper proposes use of the English Heritage extension (CRM-EH) of the standard core ontology in cultural heritage, CIDOC CRM, and exploitation of domain thesauri resources for driving and enhancing an Ontology Oriented Information Extraction process. The process of semantic indexing is based on a rule based Information Extraction technique which is facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules. Initial results suggest that the combination of Information Extraction with Knowledge resources and standard Conceptual Models is capable of supporting semantic aware term indexing. Additional efforts are required for further exploitation of the technique and adoption of formal evaluation methods for assessing the performance of the method in measurable terms Semantic indexing of 535 unpublished online documents often referred to as “Grey Literature”, from the Archaeological Data Service OASIS corpus (Online AccesS to the Index of archaeological investigationS), with respect to the CRM ontological concepts E49.Time Appellation and P19.Physical Object
AB - The paper discusses the use of Information Extraction (IE), a Natural Language Processing (NLP) technique to assist rich semantic indexing of diverse archaeological text resources. The focus of the research is to direct a semantic-aware rich indexing of diverse natural language resources with properties capable of satisfying information retrieval from on-line publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project. The paper proposes use of the English Heritage extension (CRM-EH) of the standard core ontology in cultural heritage, CIDOC CRM, and exploitation of domain thesauri resources for driving and enhancing an Ontology Oriented Information Extraction process. The process of semantic indexing is based on a rule based Information Extraction technique which is facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules. Initial results suggest that the combination of Information Extraction with Knowledge resources and standard Conceptual Models is capable of supporting semantic aware term indexing. Additional efforts are required for further exploitation of the technique and adoption of formal evaluation methods for assessing the performance of the method in measurable terms Semantic indexing of 535 unpublished online documents often referred to as “Grey Literature”, from the Archaeological Data Service OASIS corpus (Online AccesS to the Index of archaeological investigationS), with respect to the CRM ontological concepts E49.Time Appellation and P19.Physical Object
KW - natural language processing
KW - ontology oriented information extraction
KW - semantic annotations
KW - cidoc-crm
KW - crm-eh
KW - information knowledge management
U2 - 10.1108/00012531011074708
DO - 10.1108/00012531011074708
M3 - Article
VL - 62
SP - 466
EP - 475
JO - Aslib Journal of Information Management
JF - Aslib Journal of Information Management
SN - 0001-253X
IS - 4/5
ER -