As we often work with customers on this topic and hear many viewpoints, I'd like to provide some insight from Ariba. I hope some customers jump in to add their thoughts or processes as well as I know there is great variability out there.
Accuracy is the toughest of SLAs to objectively define because there is always some level of subjectiveness to it. For example, one person might view a laptop carrying bag as an IT Accessory and another as Luggage. Both are logical, but which is an accurate classification?
And accuracy also raises the question of what is better:
- An accurate but high level answer (ex. IT)?
- An inaccurate but low level classification (ex. Mouse Pad)?
The first may be accurate but is at way too high level to source and hard for a commodity manager to find and correct. The second is clearly wrong but at that depth a user is very likely to see the item and be able to correct the classification.
So what is a company to do?
- Definitions. What Ariba suggests is to define accuracy as a correct result at the lowest level of the taxonomy supported by the data. This gives weight to both a good result and a feasible level of classification.
- Correctness. As far as your vendor is concerned, its hard to expect more than correct being a code that a reasonable person would consider correct. Not quite rock solid legalese here, but this is a tough one. So in the example above, either Luggage or IT Accessory is really correct. That said, it may not be how you want the item classified, which is what matters. Here, you need the ability to either reclassify or have your vendor do it for you until you are satisfied.
- Granularity. Its important to also include the level at which the invoice line is classified and we strongly suggest defining accuracy only if at the lowest level the data supports. Reason for this is that the granularity is important, especially if not at the level sourced. You don't want vendors cheating by classifying at high levels, thereby easily classifying a high percentage of spend at high accuracy but not meeting your needs. At the same time, pushing for classification at specific levels makes no sense as your data may not support it. If an invoice contains no information other than the vendor being Dell, how can anyone classify to L4 of UNSPSC, for example. I always laugh when I hear that some vendor is guaranteeing they will classify 90% of a company's data at L4. That's a neat trick if only 70% of the data supports it. So you want the vendor to have to classify as low as posisble but not pushed to guess at the next level down, providing useless information.
- Measurement. Once agreed on a definition, the next question is how to measure. There is unfortunately no magic approach here. What we use internally in our own QA review and what we suggest to customers is to review random samples of classified invoice (or pCard or other) lines. If your vendor gives yo confidence levels for each line with associated accuracy guarantees, like Ariba), that helps somewhat in verifying vendor SLA achievement and focusing your efforts. The approach we suggest for customers is to:
- Designate a group of commodity experts, usually category managers to review results.
- Create reports filtered to items classified in their commodity for each one (your vendor should do this for you if requested)
- Have each review the results of some agreed upon number of lines (ex. 100 each) and determine if each is accurate by the definitions above. Ideally, grade each on as accurate or not and, if not, whether just the wrong category, logical but not the category preferred, or whether at the wrong level (too low or too high)
- Review with your vendor and, based on the software ability for users to make corrections and update rules/models and the nature of the errors, develop a strategy for jointly improving the results.
- A review of whether overall spend numbers make sense may also be useful but be careful as often they appear off but its either because the previously thought totals were themselves off or the dates used (accounting vs. invoice vs. other) differ or because not all spend has been loaded yet (ex. omitting pCard data)
As I mentioned earlier, please chime in with your own approach and tips. Thanks.
Great reply, Alex, thanks for sharing this.
I perfectly agree with your remarks.
Classification is very much about philosophy and perception. My favorite example is a Blackberry - is it:
- Personal Digital assistant PDAs or organizers or
- Electronic mail service provider or
- cell phone?
Concerning the vendor being Dell, and having no PO / INVOIC information you could potentially classify to a more granular level, if you would know who (person/department) was buying it. If Peter would be the responsible buyer for mainframe, then this would be an additional hint for classification.
How would you guys judge about measuring classification accuracy by comparing classified eCatalogs to the classified output received?
It would be the perfect measurement for sizing accuracy – BUT ONLY IF
- suppliers would classify their products correctly.
- suppliers would classify their products granularly
Looking forward for additional thoughts on this.
The Blackberry example is a very good one to illustrate the subjective nature of many classifications.
The concept of eCatalogs being used to measure accuracy is an interesting one and we are always open to anything that would make this measurement easier. As a vendor, we would be just as happy, if not more so, than most customers to have an objective way to measure accuracy. Would be nice to be able to "prove" we did a good job and remove the subjective nature. We have not had this approach suggested in the past. Theoretically, it could work. I am skeptical whether it would in practice though, for a few reasons:
1-Catalog Quality: As you mention, it would require suppliers to classify their products accurately and at a granular level. It also would require consistency, which from my experience in catalogs, might be the bigger issue. Different suppliers classify the same item differently and sometimes even a supplier changes over time.
2-Completeness: I don't see how this could cover more than a subset of spend, especially on the indirect side. For one thing, we would need part numbers to ensure a 1:1 catalog match, which often don't exist in spend data. And some purchases such as services, which for many companies comprise the majority of spend, would not fit this model. Then you face the issue that you would need many catalogs from different vendors to cover all catalog spend, and have to resolve discrepancies where they overlap. So we would need to have a parallel set of accuracy definitions for spend that can't be compared using catalogs. If this problem didn't exist, customers could just classify their spend on their own by doing a vlookup in Excel or Access by part against spend data.
3-Customer Alignment: Ultimately, we strive to have spend classified how our customers want it classified. While we gather information upfront to modify our existing models with customer-specific input, some of this inevitably entails customer feedback after we complete initial enrichment (having the ability for users to do this online and audit/control the feedback is critical here so ensure you can do this with your provider's solution). My concern regarding catalogs is that in many cases the customer will want items classified differently than the catalog. If the customer had to review catalogs before being used, it would add an additional and ongoing effort. We would need to still have a tiered accuracy measurement where a classification is accurate if it matches customer feedback pertaining to the item, if provided. If no relevant customer feedback provided, the catalog would apply. And then something different for non-catalog items or those without part info to assure a 1:1 match against a catalog.
All in all, I am skeptical how this would work. If anyone reading this has experience using catalogs in this manner, either successfully or not, please chime in.