In addition to the complexities of training and deploying the Autograder models discussed in the last blog post, we also faced several challenges related to the architecture and design of the web client that displayed the Autograder results. In this post, we'll be diving deep into how we served up the Autograder models that power our quality management product.
Since we developed this feature with extensibility in mind, we heavily utilized the core principles of both abstraction and composition. The high-level diagram below summarizes the architectural design that we used to ensure that future iterations and extensions of this feature could be easily added.
We aimed to make the frontend generic enough to be able to handle any Autograder. From a high level, you can see that the only place where we have logic for specific Autograders is in the LambdaFunction class (outlined in orange in the diagram above). As such, whenever we want to add or remove a particular Autograder type, we only need to add support for it within the LambdaFunction class.
Let’s walk through the user flow with an example. Hermione Granger is the manager of Harry Potter and she wants to grade an interaction that Harry had with a customer. Hermione creates a QA scorecard based off of an existing template; templates are typically set up for teams by an org admin. The first question has an Autograder attached.
The question with an Autograder is configured on the scorecard template, which we call the QAForm model. If an admin changes the question or adds a new question, those changes create a new QAFormVersion model, which stores the list of questions for that form version. This allows previous versions of the template to still render correctly. Individual scorecards are tracked using the QAScorecard model which point to specific versions of the form.
So when Hermione creates a scorecard containing a question with the Autograder, the backend uses the QA form version ID to find the correct version of the form and the corresponding list of questions.
Now, we can walk through the process of actually fetching the Autograder results and displaying them to the user. First, we extract the questions from the selected scorecard form version.
Each question object stores some relevant metadata, but for our purposes the most important pieces are: a unique question ID, an Autograder type, and Autograder arguments. We then pass this list of questions (for our selected template) to our AutograderResults API endpoint.
We intentionally created a generic endpoint so that the web client would not need to know the details of each question and Autograder in order to get results. The GET endpoint passes the list of questions to the QAAutograder class, which internally handles each question and prepares it to be passed to the AI model.
The QAAutograder helper class takes the list of questions and groups them by Autograder, and questions without Autograders are filtered out. The questions are then passed to the LambdaFunction class, which is where we finally start adding specialized logic for each Autograder.
The first thing we do here is construct the payload that we will use to store the Autograder arguments when we invoke the endpoint. Next, we invoke the SageMaker endpoint using the payload that was constructed based on the Autograder type. The payload varies because the inputs for the Autograders vary. For example, the phrase tracker Autograder expects an additional ‘phrases’ input that the sentiment and spelling & grammar Autograders do not.
Finally, we take the results that we receive back from the SageMaker endpoint (see the previous blog post for how the models for Autograders work) and format them according to how we want to display the results for each individual autograder.
All of these results are returned back from our API endpoint, indexed by question id. The frontend then parses and displays these results for each corresponding question, to provide further context and assistance to the scorecard grader.
We wanted the application to be able to support a wide variety of questions and Autograders, even though currently we only have a small number of Autograders. Because we concentrated the Autograder-specific logic to one part of the architecture (the LambdaFunction class), we can easily extend the number of supported Autograder types in the future without needing to change multiple parts of the technical stack.
Speed vs. complexity
One of the first challenges we faced in development came down to a rather common ML paradigm between speed and complexity. The more complex the Autograder, the more accurate and helpful the results would be. On the other hand, a simpler Autograder model lends itself to faster results, and consequently less disruption for the human grader. Complex Autograder results would also be more challenging to organize and display visually to the grader.
Ultimately, we found a good balance such that the Autograders were complex enough to output useful results without requiring the user to wait longer than twenty seconds for the results. We also added an elegant loading state to indicate to the user that an Autograder was loading without blocking user interaction with other parts of the QA scorecard. Furthermore, we have feedback avenues in place that allow users to report if the Autograders are too slow, or too inaccurate; our engineering team is closely monitoring these and is ready to tune as necessary.
Additionally, since the actual logic for our Autograders lives in SageMaker as opposed to within our own backend, we were able to modify our approach for calling the SageMaker endpoint to further improve the efficiency of fetching results. By design, we wanted our scorecards to have no limit to the number of questions (and therefore the number of Autograders) that could be added. As such, we could not just make one call to the endpoint per question – the performance of this approach would be highly dependent on the number of questions in any given scorecard.
Instead, we modified our logic to invoke the endpoint a constant number of times instead of once per question. This drastically decreased the overall time required to load results for any given scorecard because now the number of invocations was no longer dependent on the number of questions on the model–we were always just invoking the endpoint, at most, once per supported Autograder type.
Since this tool was meant to aid the grader and make their life easier, we also spent a substantial amount of time thinking about the most helpful and least disruptive ways to display the Autograder results. We ultimately decided on including a short summary of the Autograder results below each question that had an Autograder attached.
However, through internal testing, we discovered that while detecting phrases of interest and displaying them was valuable, it was difficult to see exactly which part of the transcript the phrase of interest came from. In the example scorecard above, the second Autograder phrase (“We received your request to delete your example account…”) does not appear above the fold on the Zendesk conversation on the right, and would still require users to read through the entire interaction to find where exactly it appears in the transcript. We realized that this problem was quite important to address, particularly when we want to “keep the human in the loop” and use the Autograder results as an aid as opposed to a replacement for an actual grader.
Our solution was to make each result clickable such that selecting a single one would highlight and scroll to the message containing that result (in this example, the detected phrase) within the transcript. This improves the grader's experience, since everything they need is now in view altogether. From a technical standpoint, we simply made sure that each Autograder result included the index of the associated message so we could build a UX interface that highlighted the appropriate message on screen.
In addition to optimizing the grader’s workflow, we also wanted the agents themselves to have a smooth experience reviewing the scorecard feedback. As an agent, I might have dozens of graded scorecards that I want to quickly read through to understand where I can improve. In this case, waiting even twenty seconds for an Autograder to load would be a frustrating experience for agents.
To make the agent experience faster, we decided to save the Autograder results on the completed scorecard model. A scorecard is only ever created when the customer interaction is complete, so we knew the agent transcript would not change. Therefore, we could store the results of the Autograder at scorecard creation time. In addition to improving the workflow for agents navigating through multiple graded scorecards, this also significantly reduced the number of redundant, unnecessary, and expensive calls we were making to the SageMaker endpoint.
Looking forward, we anticipate growing this feature to support additional Autograders as well as to support multiple Autograders per question. As discussed above, since we relied heavily on the principles of both abstraction and composition, no major rewrites of infrastructure are needed for expanded functionality. For example, further iterations such as supporting more variations of Autograders should lend themselves to very straightforward implementations that do not require much in-depth knowledge about the specific SageMaker API invocation requirements or how the AI models function within SageMaker.
QA Autograders is an elegant frontend web experience backed by powerful machine learning capabilities. We're excited to continue applying our learnings to future products with the ultimate goal of helping everyone achieve their professional best!