A Lender’s Guide to the Top 3 AVM Testing Methods

Understanding tests, benchmarks, and best practices for validating the accuracy of professional-grade automated valuation models (AVMs)

7 min read

What you’ll find in this post:

• Introduction
• The benchmarks
◦ Sale price
◦ Contract price in purchase appraisal
◦ Refinance appraisal valuation
• Concerns with testing against sale price benchmarks
• Concerns with testing against purchase appraisal benchmarks
• Conclusion
• Addendum — accuracy measurements defined

Introduction

Over the past few years, rapid progress made in the fields of machine learning and data science has spurred a renaissance of automated valuation models (AVMs), and in particular, the development of the “lending-grade” or “professional-grade” AVM. The growing movement toward using these professional-grade AVMs in higher-stakes situations has created a new responsibility for AVM providers: to approach AVM testing with the same rigor and accuracy they’ve built into their model.

AVMs are becoming less of a “gut check” tool, and more of a pillar in the home equity lending process, or for use in secondary/securitization. AVM testing methods vary in their approaches and benchmarking data, and lenders should be aware that because of the lack of standardization, there are loopholes AVM providers can leverage to “game” these tests. Using flawed test results may lead lenders to false comfort in the accuracy or performance of an AVM in normal, day-to-day use.

In this blog post, I give an overview of the different types of benchmarks and the nuances of each one, with the goal of equipping lenders with information they can use to verify the accuracy of AVM tests.

The benchmarks

To test an AVM, you need to compare the AVM’s value estimate against a real property value (benchmark) to see how accurate the model is in aggregate. This can be done nationally, or by region (state, county, ZIP code, etc.).

The benchmark you choose has implications on the performance stats, and how easy it is to tailor your AVM results to reflect the benchmark and provide an inflated accuracy measurement. Currently, most AVMs are compared to three benchmarks. Below are my thoughts on the pros and cons of each.

Sale price: The market price a property sold for, documented by the county recorder, or multiple listing service (MLS) sources.

Pros:

  • Reflects the real market value of a property.
  • There are plenty of benchmarks available (large sample size).

Cons:

  • Most AVMs have access to these prices, which allows for gaming of the test. For example, if the AVM provider can look up the test’s addresses and see the last sale value, it is easy to tailor their AVM results to better fit the benchmark.
  • AVMs use this data to train their model. In statistics, using your benchmark data in the training data leads to biased (more favorable) results.

Contract price in purchase appraisal: The contract price agreed to by a buyer and seller, denoted on an appraisal report for a property currently pending sale (typically).

Pros:

  • More difficult for an AVM to game, since most AVM providers do not have access to these benchmarks, and they are not yet reflected in the MLS (usually).
  • Allows for benchmarking/testing in non-disclosure states where sale prices are difficult to find.

Cons:

  • Most AVMs have access to the listing and pending price on these properties since they are typically listed for sale on the MLS. The listing price is easily used by AVMs to predict what the property is under contract for (they are highly correlated, but not perfect).
  • There are not as many purchase appraisal benchmarks as sale prices to form a large testing sample.

Refinance appraisal valuation: The appraiser’s expert opinion of value on a property that is undergoing the refinance process on a loan.

Pros:

  • Not easily gamed. These properties are not typically for sale, so the AVM does not know what they are listed for, what they will sell for, or any MLS-based characteristic information on them.
  • Most AVM providers do not have access to these appraised values, so their model is forced to be blind to the benchmark.
  • The appraiser’s research and valuation is the most informed and accurate valuation methodology for this scenario.
  • Allows for benchmarking/testing in non-disclosure states where sale prices are difficult to find.

Cons:

  • Not as many benchmarks available when refis are slow.
  • The benchmark is an opinion of value, not the actual market price the property sold for.

Concerns with testing against sale price benchmarks

Since the benchmark data being used consists of sale prices from public records and MLS data, any AVM provider can easily look up the same sales data they have and send back an “AVM” that is very close to the benchmark. The industry should be aware of these potential testing loopholes because there is money to be made with overinflated testing results, and it is tempting for an AVM provider to take advantage of the situation.

A note on preventing the gaming of third-party tests: AVM testers are not blind to the fact that some tests can be gamed to make an AVM look better than it actually is. They highly discourage this practice and will take steps to prevent it from happening; however, it is not always easy to detect.

Concerns with testing against purchase appraisal benchmarks

A test against purchase appraisals has room to be gamed since AVM providers can usually see what the listing price or pending price is for these properties in MLS data. This is tempting for an AVM provider to tailor their results. If the AVM provider also has access to view all appraisals that happen each week, they could tailor the results that way.

A note on comparing AVMs to appraisals: While there are national, third-party aggregators of appraisals that provide this service, they do not disclose the actual appraised values publicly. Even the AVM provider is blind to the actual data in the appraisal reports from these services. The valuations are only used in aggregate to answer the question that everyone has: “Are AVMs as good as appraisals?” (The answer in 2019: They are not. There are too many factors that appraisers take into account that a machine cannot.)

Conclusion

Choosing the right AVM provider up front is critical. Properly understanding how an AVM will perform in day-to-day operation by validating the performance metrics methodology will save lenders from having to repeat the selection process. An AVM provider should make this information easy and discoverable, enabling a hassle-free switch to a strong-performing AVM.

AVMs can be a useful tool in situations where they’ve traditionally had difficulty being consistently accurate (e.g. home equity lending). Since most tests do not shed light on how AVMs perform in scenarios where the benchmark is not known, it is important to ensure the AVM can stand up in those situations. Testing an AVM purely against the same set of refinance appraisals gives the best indication of what AVM will perform the best in lending situations.

If you’re looking to add AVMs to your lending/securitization workflow, look for my next post: “How to Conduct a Bulletproof AVM Test.”

Addendum — accuracy measurements defined

Most AVM testers produce a few common measurements to gauge accuracy. They don’t all tell the same story, and are typically used in conjunction with each other. The best AVMs find a balance between hit rate and accuracy measures, so it can be tailored to the use case (sometimes really strong accuracy is preferred over hit rate, other times the opposite is desired). A good AVM provider can adjust their model to suit both needs.

Hit rate
The number of benchmark addresses the AVM was able to predict on. The hit rate is based on the AVM’s ability to locate the address, or how much confidence the AVM has on its prediction of value for the address. Often, AVMs do not (and should not) produce results when they don’t have strong confidence in the valuation.

P10 (or PPE10)
The percentage of time the AVM is within 10 percent of each benchmark. A standard measure for the AVM industry, but does not reflect on the AVMs that are very far off from the benchmark (the spread).

MAE (mean absolute error)
Calculating the percent variance between each AVM and benchmark, then taking the absolute value of each, and averaging them over the whole test set. A good measure if using consistent benchmarks since it accounts for the times the AVM was very wrong and can throw off the MAE. In other words, it accounts for the outlier predictions.

MdAE (median absolute error)
Similar to MAE, but since it used the median, it hides the times the AVM was really far off; the outliers. This is the most common measure for consumer-grade (not professional/lending-grade) AVMs.

Subscribe to our newsletter

We’ll keep you in the loop on the latest stories, events, and industry news.