Unlock smarter e-commerce merchandising with AI and multimodal language models

Liberate your teams from high-effort, repetitive tasks, and empower them to focus on strategic initiatives and deliver seamless customer experiences that build brand trust and loyalty.

AI is fundamentally transforming the way marketing and product teams operate. Gone are the days when we had to adapt to computers; today, computers interpret and respond to us in ways we once only dreamed of.

In this article, we discuss how to use AI to automate product cataloguing, amplify product discoverability with multimodal language models and semantic searches, and personalize product recommendations with embedding models.

A quick note about AI models

OpenAI’s GPT-4 is one of the largest and most advanced large language models. Other language models include Google’s PaLM (which powers its chatbot, Bard), Anthropic’s Claude, Meta’s LLaMA, and Cohere’s Command, among others.

We recommend getting started with OpenAI’s language models based on their performance, ease-of-use, documentation, fine-tuning capabilities, and multimodal features.

For most applications, GPT-3.5 is cheaper and faster than GPT-4.

Multimodality is currently only available with GPT-4 via the chat interface (API access is expected late 2023); is not available with GPT-3.5 or any other language models.
Fine-tuning is not available with GPT-4 (though it’s expected in late 2023); fine-tuning is available with GPT-3.5 and is relatively inexpensive (costs approximately $2-5).

Setting up your own language models is challenging, especially if you’re self-hosting. If you need to customize your own, consider smaller language models (i.e., those with around 7 billion learnable parameters vs. GPT-3.5, which has 175 billion parameters) like Mistral by Hugging Face. Llama 2 is the foundation for many models, and Llama 3, on the horizon, is expected to match GPT-4's capabilities.

Adept’s new open-source multimodal model, Fuyu-8B, is available on Hugging Face. We anticipate Google will release their multimodal AI, Gemini, later this year. Many new models, including those for multimodality are expected in the next year.

Embedding models

Look for open-source, all-in-one models that can handle everything from text preprocessing to generating and managing embedding vectors. This makes it easy to access and use embeddings without manual steps.

We recommend Sentence Transformers by Hugging Face or OpenAI's embedding model. (You can also self-host Sentence Transformers.) Hugging Face and OpenAI’s embedding models are easy to set up; they host for you and offer APIs. They’re also similar in performance, though Hugging Face can be more effective and cheaper than OpenAI’s.

Smarter e-commerce merchandising

Aggregate, enrich, and catalogue product information at scale to improve product discoverability.

You can add attributes, classify taxonomies, and more.

You can use OpenAI’s UI, create your own wrapper UI, or use off-the-shelf products like Scale (as shown here) to add AI-enabled product cataloguing capability to your e-commerce platform.

Multimodality makes it even easier to generate product descriptions and categorize items. Multimodal language models can create summarization data based on an image, adding new ways to present and recommend relevant products based on semantic understanding and user intent.

GPT-4 creates product descriptions and suggests relevant categories and other information you can use to curate products more efficiently, and improve product recommendation algorithms.

Multimodal queries with GPT-4 are currently limited to OpenAI’s chat interface. Many expect multimodal API queries to be launched in November at OpenAI DevDay. We anticipate Google will release their multimodal AI, Gemini, later this year

Last week, Adept released Fuyu-8B, a small version of their not-yet-released foundation multimodal model, Action Transformer (ACT-1). Google’s Gemini is anticipated to be the first multimodal language model to be pre-trained concurrently on text and images. Existing multimodal models either combine text and visual transformers together (e.g., Chinchilla) or augments a text-based language model with visual capability, similar to how you can use a calculator plug-in with ChatGPT.

AI-enabled workflows

You can leverage GPT-4 on your e-commerce platform to validate and process unstructured text or any data file.

For example, rather than asking suppliers to submit a CSV file with the required headers, you can allow them to upload any text or data file and email them if they’re missing any required information.

With multimodal AI, you can create new workflows to skip many of the steps associated with processing images and other unstructured data, such as extracting and translating text that appear in images.

Fine-tuning for better product curation

We recommend fine-tuning language models in cases when constraining the model will improve its ability to serve a specific purpose.

For instance, you can train GPT-3.5 to better adhere to brand guidelines, use brand-specific terminology, and more consistently and reliably catalogue and curate products.

Experiment with different instructions, break complex tasks into multiple prompts, and explore other strategies before fine-tuning. With the right prompts, fine-tuning may become unnecessary or undo some of the benefits associated with its pre-trained, generalized knowledge. (Check out OpenAI’s documentation about when to use fine-tuning.)

You can upload a jsonl file. We recommend starting with at least 50 example responses; we see the best improvements with around 200 examples.

Smarter customer experiences

Amazon uses text embeddings for product searches. Essentially, embeddings create a heat map that helps identify similar or related products based on their descriptions.

Keyword-based embeddings need exhaustive data because they can’t infer user intent. If you need a tuxedo for a wedding, you’ll need to search for a tuxedo, a bow tie, dress shoes, and so on (unless the merchant curated a wedding or tuxedo bundle).

We’ve already discussed how to enrich product data with semantic information. Using AI-generated summarization data for product embeddings enables customers to find relevant outfit pieces for a wedding without having to know what a cummerbund is.

Similarly, language models can understand user preferences based on purchase history and browsing behaviour. You can leverage both user and product embeddings to personalize product recommendations, and continuously evolve your system by monitoring key metrics (e.g., conversion rate).

Sanity recently showcased how to use generative AI to infer user intent and make smarter product recommendations. For example, when users add running shoes to their wish list, you might want to show other running shoes that they might like. Alternatively, when users add running shoes to their cart, you'd want to recommend socks or other complementary products.

Streamline workflows

Language models allow teams to interact with data and documents in plain language, making information access and retrieval more intuitive and accessible.

If you use Sanity as your headless CMS, check out Sanity’s new API to add embedding capability out-of-the-box.

Sanity’s new feature enables content authors to find and add relevant content pieces. In this demo, the semantic reference search plugin automatically retrieves product entries based on campaign information the user has already written. Learn more about Sanity’s embeddings index API.

There are other ways to use AI to empower your marketing and product teams. Say you want to restock the most popular designs and sizes of your t-shirt inventory ahead of the fall season. One way to do this quickly is to upload sales data from Shopify to GPT-4 and query: “Tell me which sizes and colours/designs were the most popular, and how many I should order to fulfill anticipated orders over the next 3 months based on last year’s sales data.”

You can also use GPT-4’s Advanced Data Analysis capability (formerly known as Code Interpreter) to analyze and visualize data.

Examples of visualizations supported by GPT-4 (source: What AI can do with a toolbox).

Understanding pricing

AI models are getting smarter and cheaper every year, and with every new release.

You can access GPT-3.5 and GPT-4 via OpenAI’s chat interface with ChatGPT Plus, which costs $20/month. There’s a cap on how frequently you can query GPT-4 when using ChatGPT, but the limit is lifting quickly.

OpenAI bills API access (including Playground) separately on a per token basis.

It would cost approximately $0.13 with GPT-3.5 and $8 with GPT-4 to generate summarization data for 1,000 products like the one below.

// This JSON object has 262 characters.
{
  "productName": "Silk Cummerbund",
  "description": "Adjustable silk cummerbund",
  "price": "$259.99",
  "availableSizes": ["One Size"],
  "colours": ["Black"],
  "material": "Silk",
  "careInstructions": "Dry clean only",
  "rating": 4.3,
  "reviews": 355
}

Most of the costs associated with using Sentence Transformers is attributable to computing costs, billed on an hourly basis. You would pay $9 per month for access to Hugging Face’s APIs, plus computing costs (we recommend Spaces Hardware or Inference Endpoints), which start at 5 cents per hour for CPUs and 60 cents per hour for GPUs. You can also build an embedding model locally, which would also allow you to update it with new data, rather than building a new embedding model anytime new data is available.

An embedding model with 1,000 products would take seconds to process. A large real estate company with lots of historical data might need billions of tokens, which would take several hours to process with GPUs and days with CPUs.

Creating an embedding model with 1,000 products would cost $5-10 with OpenAI’s embedding model API.

Summary

We recommend OpenAI’s GPT-3.5 or GPT-4 for most e-commerce applications. For embeddings, we recommend Sentence Transformers by Hugging Face or OpenAI's embedding model.
Curate thousands of products in minutes for the price of a coffee. Use language models to categorize your products and generate semantic information to build seamless shopping experiences.
Multimodal language models understand images and are available via OpenAI’s chat interface with GPT-4 (API access is expected in late 2023). Adept’s new open-source multimodal model, Fuyu-8B, is available on Hugging Face.
Leverage AI-generated insights about your products and customers to build better search and recommendation algorithms.
Automate high-effort, repetitive tasks, and empower teams to focus on strategic initiatives that build brand trust and loyalty.

Rangle

Unlock smarter e-commerce merchandising with AI and multimodal language models