What are the potential risks across the entire risk landscape (impact and likelihood)? Consider social, technological, operational, political, economic, and sustainability factors, which may include privacy, human and civil rights, bias, cybersecurity, supply chain risks, and concerns to traceability and transparency. Consider indirect and 2nd or 3rd-order effects, and possible emergent behaviors of the system.
What is the risk response (including avoidance, mitigation, transference, and acceptance) for each risk?
What is the evaluated risk and residual risk for each risk after mitigation? How are the risks evaluated (qualitatively or quantitatively)? Refer to DAGR and other risk management frameworks for recommendations.
How is each risk monitored and measured throughout the AI capabilities lifecycle?
Prioritize risks. Refer to DAGR and other risk management frameworks for recommendations. This outcome may be used for incident response prioritization. Consider prioritizing based on data, infrastructure, security, accountability, resources, and model operation (DISARM Hierarchy of Risk in DAGR).
Document the risk relationships and dependencies between AI capabilities, if they exist. Document changes to the residual risk based on the relationships and dependencies if they exist. Refer to DAGR and other risk management frameworks for recommendations.
Identify those responsible for risk, and describe coordination cadence in relation to the risk relationships and prioritization of risks of this project
What possible risks are posed by use of the system for purposes other than those for which it was originally developed or procured, combining the system with other components (including with future, yet-to-be-developed technology or capabilities), training on datasets that go beyond the originally intended types of datasets, etc.? How can these be mitigated?
How will different error types and failure modes be handled?
How will error rates and failure modes be measured?
What steps have been taken to ensure error rates, failure modes, and behavior in general are consistent for edge cases and imbalanced groups?
How has sensitivity testing been conducted to ensure consistent and reliable error rates and failure modes in real-world settings?