3.1 HONE: Assess Requirements, Statements of Concern, Mitigations, and Metrics
Ensure all mitigation action measures and controls have a method of being assessed and monitored throughout the life cycle.
Do all requirements (operational, functional, and technical) have appropriate modes of assessment, and benchmarks of success and failure?
Do all statements of concern and mitigation actions have appropriate modes of assessment, and benchmarks of success and failure?
Do all tradeoffs, trustworthiness, and confidence measures have appropriate modes of assessment, and benchmarks of success and failure?
How will performance metrics be established?
How will baseline metrics for system performance be established?
How will errors be detected?
Given the use case, potential consequences, and affected stakeholders for this context, is it better to minimize certain types of error (i.e. considerations of precision vs. recall, Type I error vs. Type II error, etc.)?
How will error rates be measured, and how will they be measured in terms of how they affect different sub-populations?
Will error rates be recorded in the Impact Assessment?
Will the metrics need to evolve as the system behavior changes during use (i.e. feedback loops)?
How will user understanding be measured?
Are there any measurement gaps or limits to the precision of measurement?
Are there latent constructs or other factors that will be difficult to operationalize or measure?
How will these issues affect risk calculations/impact analyses – do these need to be revisited?
Is the system’s ontology appropriate for the use case and for tracking alignment with the DoD AI Ethical Principles?
Are all of your Statements of Concern and all aspects of your legal/ethical/policy frameworks sufficiently addressed?
If not, re-conduct activities under the Intake and Ideation phases.
What are the anticipated failures? How will these be detected?
Is there a process for system rollback and/or stoppage?
Has a data ethics review been conducted?
3.2 HONE: Exploratory Data Analysis
How was the data collected?
Could certain classes or populations have been undersampled?
Is the data representative of the use case/deployment context?
Has the data become stale? How often will it need to be updated?
Given the above, does the data need to be re-collected?
Will re-collection of the data place additional burdens upon the sampled population?
What other steps can be taken to improve the quality of the data?
How was the data labelled?
Is ground truth accessible given the data type?
Could human biases affect how the data was labelled?
Could societal context affect how the data was labelled?
Given the above, does the data need to be re-labeled or is it insufficient to proceed to development?
Data Provenance, Protection, and Access
How is the data accessed? Who has access and how is it controlled?
Where is the data stored?
How is the data protected?
What ensures data provenance? How are transformations and cleaning recorded?
Is any of the data generated synthetically, or should it be?
How is the data used?
What abnormalities, outliers, or irregularities are present in the data?
Were these irregularities a source of human error, sensor error, processing error, or natural or adversarial perturbation? What mitigations are required for greater accuracy?
Will data or feedback used to update or fine-tune the model at later stages (such as through Reinforcement Learning with Human Feedback [RLHF]) have any of the issues contained in 3.2.1 or 3.2.2? How will these be mitigated?
Utilize bias identification and mitigation techniques
Have you determined which operationalization of 'fairness' is appropriate for your purposes?
Has the underlying dataset and the model been checked for unintended bias and (if applicable) mitigations been applied (including dataset, in-processing, or post-processing bias mitigations)? Has the team considered how the underlying datasets may reflect the biases of the institution or individuals that collected it (including prejudice bias), the sampling or measurement methods used (measurement, sample/exclusion bias), or of the individuals represented in the dataset?
Are stakeholders being consulted in terms of their domain knowledge regarding sources of unintended bias?
Have any biases and possible biases (including cognitive biases, such as automation bias) of the designers and operational users been addressed through training or system design, or leveraged in ways that contribute to system success?
According to your legal/ethical/policy frameworks, SOCs, mitigations, use case, and mission domain – is AI suitable for this use case? Be sure to answer the following:
Is AI suitable for the task at hand? Is the model type appropriate for the task at hand? Do the advantages outweigh the disadvantages, known or possible? Does utilizing AI in this case achieve something that a non-AI tool could not accomplish?
What is the specific task that the system performs?
What is the system input and output required to perform that task?