Product Title Matching For SKU Management With NLP

A quick dive into how you can automate product data matching and SKU management using just product titles with NLP.

Product title matching is the process of matching similar or exact products from different sources based strictly on the title and other headline attributes of the product. As data variance and data sources grow in an organization it can become harder to keep product data accurate and manage new SKUs. Issues come up when using different suppliers and vendors and keeping high quality product data becomes harder. This can cause issues when evaluating sales data and understanding your marketing efforts and the success rate.

While this is often done manually it can become extremely time consuming and scales poorly. Old school systems focused on just using basic product attributes like SKUs and UPC codes that do not work well with modern unstructured data. These older systems require auxiliary processes to extract attributes, remove duplicates, and clean stop words from the unstructured product data. Even with all the data cleansing and keyword extraction these systems still struggle with things like this:

GIGABYTE – 15.6″ FHD IPS 144Hz Gaming Laptop – i5-11400H – 16GB – NVIDIA GeForce RTX 3050 512 GB SSD

And

15.6″ Notebook – i5-11400H – 16GB – GeForce RTX 3050 512 GB Black 6494784

To understand word relationships such as “laptop” and “notebook”, and part of speech keys to match GeForce we’ll need to use natural language processing.

What Product Title Matching Can Provide For You

Product data matching based on title provides retailers and ecommerce brands a ton of benefits in the world of sales data and marketing intelligence.

Organize products and SKUs across multiple vendors and suppliers
Use competitor data to understand market trends and competitive pricing
Understand product life cycle
Ensure there are no missing pieces in your sales data and marketing campaigns

Using a product title based matching system allows you to ensure you always have the exact information you need to perform data matching. Other systems that require a ton of data points or in-depth product descriptions can struggle as you scale into more products. We’ve found that using a deep learning based NLP system that focuses on product title allows you to get similar results without the long term scaling risk. We’ve been able to use product title matching as a baseline and build other models around it such as UPC matching and product description matching to simply enhance results, not rely on.

We’ve built our product title matching software using popular NLP models such as GPT-3, BERT, and SBERT to learn the relationship between different title language features, title attributes such as brand name, product name, type etc. These deep learning based models are far superior above fuzzy matching and other rule based approaches and are proven to scale easily with new data variance and noise.

Matching between: Garmin nuvi 2699LMTHD â€” GPS navigator â€” automotive 6.1 in nuvi 2699LMTHD Automobile Portable GPS Navigator

This result from the NLP software shows a few important things:

Stopwords and characters don’t affect our ability to match two product titles
The model can the words in the title that matter no matter the order or any noise words are them.
Brand names are not required for us to find matches or decline a match.
Product attributes are not required (size, length) in each product we’re comparing and don’t have to be the same type.

The product title model picks up on small but important differences between container sizes that are considered different SKUs in the product database. In the second example we see there are a bunch of moving parts – different bottle counts and unstructured data noise but still an easy match.

Refining For Production Use Case

This product title matching software product can be fine-tuned on a retail store or ecommerce brand’s actual product data to push the accuracy past other products for your specific use case. This level of customization is available because of the language model architecture used to build the product title matcher, instead of using gimmicky fuzzer matchers or entity extraction models. The ability to fine-tune the architecture for a specific company’s data allows for better scalability as well as it becomes much easier to adjust to changes in unstructured data as you add more products or sources.

Relativity In Product Matching

As you might have noticed the idea of product matching can be somewhat relative based on what use case you’re trying to cover. If you’re looking to differentiate products based on SKU you’re going to want different results then if you were trying to understand market size and competitor products.

For instance if you have these two product titles:

Chios Mastiha Pack 60gr (2.11 oz) Small Tears Gum 100% Natural Mastic Gum From Mastic Growers Fresh

Chios Mastiha Pack 25gr (0.88oz) Medium Tears Gum 100% Natural Mastic Gum From Mastic Growers Fresh

You could consider them not a match based on the idea they have two different SKUs inside the same store, but could also consider them a match based on the idea they are both Mastic Gum. If we now included this product title in the mix:

Horbaach Mastic Gum 1500mg 120 Capsules | Non-GMO & Gluten Free

We have to decide beforehand what we are matching for. This is clearly a competitor’s product and has a different UPC code, but it is still Mastic Gum and if we are just looking for products under the same “umbrella” then this is a match. Lot’s to think about when designing your product data matching systems.

When you’re using an NLP based product title matching tool this level of flexibility becomes a breeze. We simply fine-tune our architecture for your use case no matter what you consider a “match” and optimize towards that. This level of flexibility is a game changer when looking to use the same architecture for many different use cases inside an organization and still reach high accuracy.

Our SKU based pipeline correctly considers this a no match.

Product Data Extraction

Once we’ve already matched product titles and have an understanding of either our internal sales data variance or competitor product data we can use product categorization models or NLP based attribute extraction tools to fill in any data gaps we have such as product size, manufacturer name, and product attributes automatically. These pipelines use the same architecture as our product matching so they can be easily integrated.

Improve Your Product Taxonomy

Example of generating product categories and tags from our GPT-3 model.

With the product title matching tool you can improve the clarity of your taxonomy by combining multiple matching products attributes together into a single category. This greatly cleans up and standardizes the attributes that make up your taxonomy system.

GIGABYTE – 15.6″ FHD IPS 144Hz Gaming Laptop – i5-11400H – 16GB – NVIDIA GeForce RTX 3050 512 GB SSD

And

15.6″ Notebook – i5-11400H – 16GB – GeForce RTX 3050 512 GB Black 6494784

Understanding that these are both the same product allows you to fill in any gaps such as putting “Notebook” and “Laptop” in the same category, “NVIDIA” as the manufacturer for both products and so on. This let’s you find miscategorized products and fill in any gaps.