My thoughts on Applied ML Days 2024
AMLD Thoughts
Applied Machine Learning Days (AMLD) 2024 is over for me! I spent two full days in Lausanne, and the conference took place at the EPFL campus. Here’s a summary of my takeaways!
AI and Search Systems
I was proud to organize the AI and Search Systems track. In the past versions of the conference, I noticed a gap: while we all use search systems multiple times daily, there was no dedicated track on it. I was very happy when my proposal was accepted! For its first iteration, I preferred a short, 90-minute track as it is easier to plan and can serve as a pilot for the future. As the venue is applied, I decided to dedicate the three slots on the track mainly to lessons learned from putting search systems into production in different companies.
The session kicked off with a presentation from Stephane Clinchant titled “Towards Effective and Efficient Sparse Neural Information.” Stephane, who presented on his team’s work on Splade (Millions of downloads at Huggingface), made a parallel with Lethe (which means forgetting in Greek), a river in Greek mythology where people who drank from it forgot their past. He argued that the excitement over dense text representation made the community “forget” the effectiveness of sparse representations like BM25, which are the powerhouse of most search systems in production. With his team, they created Splade: it uses large language models like BERT to create sparse text representations whose properties are inspired by BM25 type models and are enforced in the model by different loss functions during optimization. At the time of its publication, Splade performed excellently, topping the BEIR leaderboard; most importantly, Splade generalized really well in unseen domains and re-ignited interest in sparse representations.
My view: I really agree with Stephane’s view: sparse representations should be the v1 of any search system. Once in place and properly tuned, teams should look for upsides by combining sparse with dense representations in a hybrid mode. In this CIKM paper “Comparative Analysis of Open Source and Commercial Embedding Models for Question Answering” I showed that sparse representations are really competitive in diverse settings (e.g., when word distributions do not match the training dataset distributions).. They are also fast, cheap, extensible and there is a lot of dedicated tools to serve them. That said, dense representations cover a different need (semantic similarity) and deserve to be in the production solution!
The next presentation was from Eleftherios Spyromitros from the Expedia group, titled “Revolutionizing Question Answering on Travel Data Applications with Retrieval Augmented Generation, From Prototype to Production,” discussing the three main challenges their team at Expedia faced when productionizing a RAG system. Anecdotally, from a hackathon project that was up and running after a few days, it took almost a year to reach the pilot stage. Lefteris presented three families of challenges: Guardrails: We cannot put an LLM-powered system into production without ensuring that the system cannot cause harm to users and, by extension, to the company’s image. Google is an unfortunate example of this. So with the team, they built several guardrail modules in the RAG pipeline ranging from RegEx to ML systems to ensure that the system answers only the important subset of user asks. Retrieval Optimization: Lefteris agreed with the first presentation of Stephane and suggested that the best solution combines both sparse and dense retrieval methods. He also shared anecdotes on why retrieval (the R) really matters in RAG! Evaluation: It is an uphill battle. Evaluating RAG in a holistic, reliable, and scalable way is not trivial! He shared some anecdotes from the effort and concluded that to avoid armies of human evaluators, relying on high-quality LLMs (>GPT4) is a good solution and shared encouraging results where such automatic evaluation correlated well with human evaluation.
My view: We can implement a RAG system today with langchain literally in few lines of code.. And have a quick Gradio demo to showcase it.. From this point to put it in prod there is a big distance, as Lefteris said. I did the same in Salesforce, a demo can serve well to get sponsorship and buy in from leadership, but a full fledged product is another story. And in the process, there are so many decisions from chunking to how to combine sparse/dense retrievers and how to enforce business logic that each requires careful thought, planning and dedicated experiments. It is quite reassuring to see though that the Expedia team faces the same challenges with those I face in Salesforce!
The last presentation was from Sarah Le Moigne and Claire Helme-Guizon. They came all the way from Paris where they develop ML ranking models for Algolia. Algolia provides Search As A Service to other companies, and they touched on the challenges of developing ranking solutions that need to work well across a wide range of customers that are very diverse! Well, just like we do at Salesforce. The question: How do you do this so that (i) it scales (100s of thousands of predictions per day) (ii) it is easy to refresh daily (iii) it is cheap to serve on CPU and (iv) it is explainable for the user! Not easy, right? They defended their choice to do a model per customer -so thousands of models refreshed daily- and shared several of the challenges they are facing and the steps to solve them. And, yes, bias is one of them, as everyone who works in search tries to deal with at least presentation and selection bias and deals with the click imbalance issue! So many potential solutions and things to experiment with on this… As an organizer, my feeling is that the session was successful! I am thankful for the speakers who accepted to present and the energy and time they put into it! As a data scientist, I had defined my success metrics, and they were green! The room was full, people stayed with us even when lunch was served, the questions were interesting and to the point, and people really tried to participate and engage. I am already looking forward to the next iteration!
My view: Well, a per-customer model.. What a north star! It is true that the diversity of data between customers make this a sound choice but refreshing, serving and monitoring 10Ks of models daily is am enormous data science and engineering challenge. I like it! And what to add for bias.. It is everywhere. But I really like simple ideas that can get you far.. Dealing with the trade-off between exploration/exploitation using interleaving in the SERP can serve this purpose for example.
The Other AMLD Sessions
Perhaps the main reason I enjoy AMLD is its diversity. It brings together ML folks from very diverse applications: Pharma and drug design, medicine, aviation, finance, retail, regulators, etc. I decided to attend sessions that are far from my specialty, and I was inspired: people in very diverse settings use similar models and face the same challenges: trusted evaluation, privacy, data quality, data quantity, talent… And the fun fact is that everybody agrees that ML is already transforming these domains and expects further acceleration… Excellent time to be part of it!