Go with the flow - workflows as a recipe for reproducible results

    Activity: Talk or presentationOral presentation

    Description

    The cultural heritage domain (and others) has experienced a movement towards open data and FAIR data principles with the aim of improving the visibility and reusability of datasets. These principles could and should extend to documenting paradata on the methodology and versioned tools and dependencies used in the creation of datasets, to allow them to be reproduced (or adjusted, using different input parameters) and to improve the provenance and credibility of the dataset giving greater confidence it can be reused.
    Recent years have seen increasing calls to address the “replication crisis” in research, emphasising a need for the detailed documentation of methodologies and datasets. A methodology overview as described in a published paper may not be presented at a sufficient level of granularity - e.g. precise versioning of tools, and dependencies used. Instead, or additionally, workflows and associated datasets can be published and formally referenced - thereby providing concise, clear and objective instructions (a recipe) for sourcing data and reproducing the results as described in a publication. This approach also potentially facilitates incremental improvement of results by forking and revising particular steps.
    We describe initial versions of workflows created as part of the ATRIUM project for the task of performing vocabulary-driven Named Entity Recognition (NER) on archaeological texts. The main purpose of this processing is to enable semantic enrichment of existing subject metadata, building on and extending work previously undertaken in the ARIADNE project on KOS based enrichment of archaeological fieldwork reports [1]. Identifying instances of vocabulary concepts within the text as candidates for subject indexing can contribute towards greater semantic integration of the reports being processed [2].

    The main steps are to:
    • Extract pertinent text from a supplied set of published reports or grey literature describing archaeological interventions.
    • Apply a vocabulary-driven NER process using a pipeline architecture to annotate the text extracts - identifying named entities (year spans, named periods, object and monument types, place names, activities etc.) within the input text, based on a set of pre-defined chosen controlled vocabularies and a combination of general and bespoke supplementary rules.
    • Produce output in various formats including listings of span entities locating the character positions of the named entities within the text, reconciled to vocabulary concept identifiers wherever possible.
    The individual steps described above can be formulated in terms of a workflow. A modular approach means workflows may be nested - so each step may be described within another (more granular) workflow, which could be produced and/or performed by another party. Workflows are not necessarily a linear sequence of steps; the next step may branch depending on some condition being met. Our aim for the workflow is that it is:
    • Modular - each step of a workflow may itself be a standalone workflow. In the steps described above, the first step is a separate workflow undertaken by project members in a separate organisation.
    • Reusable - modular workflows (and the resources they reference) may be reused within other workflows. Some steps in our own workflow are shared among multiple workflows. This can create synergy savings where there may be some overlap of requirements in multiple work package tasks in a project.
    • Actionable – the aim is to exceed traditional technical documentation by including working source code examples to illustrate functionality. We achieve this using Python notebooks published in a public GitHub repository. These notebooks contain commented and executable step-by-step source code examples illustrating precisely how the input data input is obtained and processed, and what the resultant output will be.
    The process of formulating and documenting a workflow is itself beneficial as an aid to clarifying understanding (and perhaps rethinking) the work being undertaken and the methodology adopted. In addition to I/O formats, data types & structures, tools & dependencies, documentation could address and describe any specific observed scalability limitations and an appraisal of any advantages/deficiencies of the approach taken that may be revisited and perhaps improved in subsequent versions.
    The final workflows are intended to be published via the Social Sciences and Humanities (SSH) Open Marketplace to provide a suite of openly available, referenceable and reusable resources. Once published mutual referencing may be performed, whereby a dataset references the workflow employed to produce it, and the workflow can reference the resultant dataset as an output. Any associated source code and examples will be published via public GitHub repositories.

    References
    [1] Binding, C & Tudhope, D 2024, 'KOS-based enrichment of archaeological fieldwork reports', Knowledge Organization, vol. 51, no. 5, pp. 292 - 299. https://doi.org/10.5771/0943-7444-2024-5-292Yyy
    [2] Binding, C, Tudhope, D & Vlachidis, A 2019, 'A study of semantic integration across archaeological data and reports in different languages', Journal of Information Science, vol. 45, no. 3, pp. 364-386. https://doi.org/10.1177/0165551518789874
    Period7 May 2025
    Event titleComputer Applications and Quantitative Methods in Archaeology (CAA) 2025: Digital Horizons: Embracing heritage in an evolving world
    Event typeConference
    Conference number52
    LocationAthens, GreeceShow on map
    Degree of RecognitionInternational