The stress testing of AI-based systems differs from the approach taken for more traditional web-services, both in terms of the design of test cases and the metrics used to measure quality. The expected variability in responses of an AI-based system to the same request adds a level of complexity to stress testing, when compared to more standard systems where the system response is deterministic and any deviations may easily be characterized as product defects. Generating test cases for AI-based systems requires balancing breadth of test cases with depth of response quality: most AI-systems may not return a 'perfectí answer. An example of a Machine Learning translation system is considered, and the approach used to stress testing it is presented, alongside comparisons with a more traditional approach. The challenges of shipping such a system to support a growing set of features and language pairs necessitate a mature prioritization of test cases. This approach has been successful in shipping a web-service that currently serves millions of users per day.
Back to symposium main page