00:00:07 Differentiable programming as a new term.
00:01:38 The core idea of gradient-based optimization in deep learning.
00:03:48 The breakthrough of automatic differentiation in computing gradients.
00:06:00 Differentiable programming’s origins and its impact on machine learning.
00:07:43 The complexity of seemingly simple tasks and the iterative progress in AI.
00:09:33 Transition from neural networks to deep learning and differentiable programming.
00:11:22 The benefits of differentiable programming in supply chain optimization.
00:13:26 Coupling pricing, demand forecasting, and stock allocation in supply chain management.
00:15:00 Differentiable programming in machine learning and addressing uncertainties.
00:16:00 Differentiable programming in supply chain and comparison to big tech companies.
00:18:19 Applying AI techniques from other fields to supply chain problems.
00:20:15 Benefits of differentiable programming for predictive modeling and optimization in supply chains.
00:22:01 The challenges ahead for differentiable programming in supply chain management.
00:24:16 Closing thoughts.
Summary
Kieran Chandler interviewed Joannes Vermorel, founder of Lokad, about differentiable programming and its significance in AI and supply chain optimization. Differentiable programming emerged from breakthroughs in automatic differentiation techniques and has evolved from neural networks and deep learning. The modularity of differentiable programming allows for more flexible and versatile model assembly, making it useful for supply chain optimization. While there are concerns about reliability, Vermorel highlights the effectiveness of machine learning techniques. Lokad, despite a smaller budget, stays competitive by adapting research from tech giants for supply chain applications. Differentiable programming offers a more expressive solution for crafting numerical recipes to fit business problems, though achieving consistent results without failures remains a challenge.
Extended Summary
Kieran Chandler, the host of the interview, discussed the topic of differentiable programming with Joannes Vermorel, the founder of Lokad, a software company specializing in supply chain optimization. Vermorel provided insight into the origins and significance of differentiable programming as a concept and its relationship with AI and deep learning.
Yann Lecun, the director of AI research at Facebook, recently suggested that the term “deep learning” has outlived its usefulness as a buzzword and proposed the use of “differentiable programming” to reflect newer developments in software. Vermorel commented on the constant evolution of terminology in the AI field, stating that as soon as a solution is found for a problem, it is no longer considered AI and is given a new name, with differentiable programming being the latest example.
Differentiable programming’s origins can be traced to gradient-based optimization, a core concept in deep learning. Gradient-based optimization involves training a model with millions of parameters using an objective function, which assesses the quality of the model’s results. As new data points are observed, the gradient, or multi-dimensional derivative, is computed to guide adjustments to the parameters, gradually improving the model.
Historically, neural networks, which predate deep learning, employed complex techniques such as backpropagation to compute gradients. These techniques were difficult to implement and performed relatively slowly. A breakthrough occurred about a decade ago when researchers began using automatic differentiation, a technique first discovered in the 1960s. Automatic differentiation simplifies the computation of gradients, but it remained largely ignored by the scientific community until its potential was realized more recently.
The discussion revolves around the concept of differentiable programming and its development, as well as its applications in supply chain optimization.
Differentiable programming emerged as a result of breakthroughs in automatic differentiation techniques, which allowed for the computation of gradients for any program, not just any function. This approach enabled the development of more complex computational networks that could be applied to a wider range of problems. The term “differentiable programming” comes from the idea of computing the derivative of a program.
The development of differentiable programming has been iterative, building upon earlier concepts in neural networks and deep learning. Progress in the field has been steady over the past 50-60 years, though initially, there was a misconception that certain problems, such as identifying a dog, would be easier to solve than complex calculations like computing a logarithm. In reality, seemingly simple problems like object recognition and maintaining balance proved more challenging, while calculations became relatively easy with modern computing architecture.
The transition from neural networks to deep learning involved discarding biological inspiration to focus on what worked with computer hardware. The next stage, differentiable programming, built upon deep learning’s modularity, enabling the composition, stacking, concatenation, and mixing of machine learning models. This modularity was crucial for supply chain optimization, which involves diverse elements such as prices, products, clients, locations, and containers.
As people started to build deep learning toolkits that resembled programming languages, the idea of differentiable programming emerged as a natural extension. The automatic differentiation techniques made it straightforward to design and engineer the necessary toolkits. In practice, differentiable programming involves combining various models and blocks, much like a Lego-brick approach. However, compared to deep learning, differentiable programming allows for more flexible and versatile model assembly.
Vermorel explains that differentiable programming allows for programmatic expressiveness, enabling users to revisit problems and express their solutions in a more precise and efficient manner. He provides an example of how differentiable programming can be used to optimize pricing, demand forecasting, and stock allocation. These factors are interconnected; modifying the pricing strategy will impact demand, which in turn affects the required production and stock levels.
Differentiable programming allows users to write programs that leave blank spaces for parameters to be optimized. A supply chain scientist can write these programs and utilize the appropriate technology for optimization. Chandler raises concerns about the reliability of solutions produced by differentiable programming since it involves blank spaces and relies on machine learning. Vermorel acknowledges the limitations, but points out that machine learning techniques have already shown their effectiveness, as evidenced by their success in outperforming human players in games like Go and chess.
When asked about the research and development efforts in differentiable programming at Lokad compared to large tech companies like Facebook, Vermorel admits that their budget is significantly smaller. However, he emphasizes that the research conducted by these tech giants is often published, allowing smaller companies like Lokad to study and take inspiration from their work. The key challenge for Lokad is to stay up-to-date with these publications and adapt the research findings to fit a supply chain mindset.
Vermorel points out that the primary focus of large tech companies is on big AI problems, such as computer vision, speech recognition, speech synthesis, and natural language processing. These areas are not directly related to supply chain management, which is where Lokad’s expertise lies. By keeping a close eye on the research produced by these tech giants and re-engineering it for supply chain applications, Lokad aims to stay competitive in the field of differentiable programming for supply chain optimization.
Vermorel emphasizes that many of the insights gained from AI research are not specific to images or speech, but rather more fundamental to learning from data. These insights can be applied to different problems, including supply chain management, where they may even work better.
The main benefit of differential programming in supply chain management, according to Vermorel, is its ability to deal with unknowns and predictive modeling without derailing the business. The challenge lies in aligning numerical solutions with specific business drivers while still being versatile and expressive. Differentiable programming offers a more expressive solution, making it easier to craft numerical recipes that fit the business problem.
Vermorel notes that one of the biggest challenges in applying differentiable programming to supply chain management is to establish a series of constructs and programming building blocks that work well for the industry. While automatic differentiation can differentiate any program, it is crucial to find specific ways of crafting problems that yield not only good results but also steady and reliable ones suitable for production. The goal is to achieve consistent results without catastrophic failures, which is still a challenge ahead.
Full Transcript
Kieran Chandler: Today, we’re going to continue our mini-series by looking at its origins. So, Joannes, differentiable programming is yet another buzzword in the world of technology. Do we really need another one?
Joannes Vermorel: I guess so. It’s very interesting because as soon as people start to have a solution that works for problems, suddenly it’s not AI anymore. It comes with a different name. AI is the generic term to say it’s terra incognita; we don’t know how to solve those classes of problems. As soon as we have a solution, the solution has a name, and typically it has been a series of relatively iterative breakthroughs with many iterations. Then it comes with a name that reflects what is dominating as part of the numerical recipe in this solution.
Kieran Chandler: Okay, and then we move on to differentiable programming. What’s the story behind that? Where did the name come from and how did we get to that?
Joannes Vermorel: The name comes from one of the ingredients that powered deep learning, which is the idea of gradient-based optimization. What does that mean? A gradient-based optimization means that you have a model with potentially millions of parameters. The way you’re going to train those parameters is by having an objective function, an equation that tells you if your results are good or bad. The idea is that whenever you look at a data point, you have the information that flows back through this function, and you can compute the gradient. The gradient tells you that if you steer the parameters just a little bit in this direction, it will locally improve the objective function just a little bit. That’s the idea at the core of stochastic gradient descent, which is the algorithm used to optimize modern machine learning algorithms and all of deep learning, for example.
So, we have this gradient-based approach with many parameters, and the idea is to move the parameters a little bit every single time you observe a new data point so that you can gradually get better. The problem then becomes how to compute this gradient. By the way, the gradient is just a big name for multi-dimensional derivatives. If you have high school algebra, you look at the derivatives, the slope of a curve in dimension one. If you have many dimensions, you’re going to refer to them as a gradient. Because you have many parameters, you want to compute the slope for every single one of those parameters.
It turns out that historically, neural networks, which probably came before deep learning, had all sorts of very complicated techniques called backpropagation of the gradients. In terms of complexity to implement and performance, it was complicated and kind of slow. Not that it was slow, but slower than what we have today. One of the breakthroughs that unlocked differentiable programming was that people started to realize about ten years ago that they could use a technique called automatic differentiation, which is, by the way, 50 years old. It was first uncovered in the mid-60s, so it’s been quite a while. But it stayed largely ignored by the scientific community. It was rediscovered multiple times, but somehow those discoveries did not gain widespread attention.
Not people that we’re working on mashing the things that they were completely different fields, and so those breakthroughs remained, I would say, largely ignored. And so, we had the machine learning community that was kind of stuck with the back propagation techniques that were very complicated, very tedious to implement, and suddenly people started to realize that with those automatic differentiation techniques, you could compute the gradient for literally any program, not just any function, any program. And that was like something that was completely game-changing in terms of implementation. Suddenly you could come up with a computational network that is arbitrarily complicated, not just stacking more layers, but having a complete arbitrary program and then apply this very same gradient descent techniques. And that’s where the term differentiable programming comes from, it’s from the idea that you’re going to differentiate, as in compute the derivative of a program. That’s where the name came from, and it reflects a bit of the ambition of those latest advancements in machine learning to think of super complex architectures for the computational networks that can be like arbitrary programs and the name differentiable programming.
Kieran Chandler: Okay, lots taken. So let’s try to unpick a little bit of it. You said some of these ideas came around in the ’50s and ’60s. It’s not really like the rapid development that we’re seeing in AI and things like that. So there’s actually quite an iterative approach to get to differentiable programming?
Joannes Vermorel: Absolutely, but the reality is that even deep learning before that was super iterative and neural networks before that were super iterative. I mean, the pace of progress has been actually quite rapid for the last 50-60 years. It has been incredibly rapid. And what is intriguing is that, you know, in the early ’60s, people thought that, “Oh, if we can crack multiplication, expansion, or all those hard calculations, then identifying a dog will be super easy. After all, any person on the street can say this is a dog, but it takes a super trained mathematician to compute the logarithm. So obviously, computing a logarithm is way more complicated than identifying whether you have a dog in front of you.”
And the biggest surprise came from the fact that it’s actually the reverse. Those problems that we completely take for granted, such as for example, being able to keep your balance while standing on two feet, are tricky. I mean, it’s where it’s actually, if you just stop moving, you just fall. So it’s completely dynamic. It’s not, you know, standing upright, having a bipedal robot is like an engineering nightmare. It’s much easier to have things that just work on wheels and that are completely stable by design.
So the sort of super simple problems, like standing upright, identifying if what you have in front of you is a dog or chicken, or something else completely, like just a poster with a picture of a dog on the poster instead of having a real dog. Those problems are very difficult, and the problems like computing your logarithms are actually super easy with the computing architecture that we have. So it was a big surprise, and it took literally multiple decades to realize we had to discover even to start tackling super fundamental problems.
Hence the fact that we have been talking about AI for decades, the progress has been very real, very steady, but it’s just that there was so much to discover that it felt, maybe from an external viewpoint, that it was kind of sluggish just because people set the wrong expectations at the very beginning. But it’s coming, and it’s literally making still a lot of progress, and we’ve now realized that we have probably
Kieran Chandler: And how about differentiable programming? What was the inspiration behind that?
Joannes Vermorel: The key, very interesting insight was a transition from neural networks to deep learning. The idea was to completely give up on all the biological inspiration and to realize that if we wanted to move forward, we had to discard the biological inspiration to focus on what is actually working with computer hardware. One of the key insights that drove deep learning was its modularity. You can build machine learning models in a way that is extremely modular; you can compose them, stack them, concatenate them, and mix them in plenty of ways.
Kieran Chandler: And why is that of prime interest for supply chains?
Joannes Vermorel: It’s because we want to mix things such as prices, products, clients, locations, containers, and all sorts of very diverse objects that need to be put together to resolve supply chain problems. You need to address all this diversity. When you start having models that you can compose in many ways, you end up with a programming language, quite literally. The interesting thing is that people started to build toolkits in deep learning that were closer and closer to actual programming languages. For example, Microsoft released their computational Network toolkit, CNTK, which had BrainScript, a domain-specific programming language for deep learning. The next stage was to go full programming.
Kieran Chandler: So it’s kind of like a Lego brick approach, combining different blocks from different places and combining these models in different ways. How does that actually work in practice? How are you implementing this coding behind it?
Joannes Vermorel: Legos were pretty much the archetype of what people had with deep learning, indeed. It was about combining blocks, but in ways that were fairly limited. There’s a spectrum between deep learning and differentiable programming without a clear demarcation. The difference was that people transitioned from Legos, where it was just a matter of assembling parts, to programming, where you can do the same thing but with programmatic expressiveness. For supply chains, that means you can revisit problems and express your solution in a way that is much more succinct and to the point of the problem you’re trying to solve.
Kieran Chandler: Can you give an example?
Joannes Vermorel: Sure, let’s try to jointly optimize pricing, demand forecasting, and stock allocation. When you think about it, all those things are completely coupled. I’m forecasting the demand, but I know that if I tweak my pricing strategy, I will modify the demand. If I modify the demand, it has an impact on how much I should produce and stock because the demand will be different. All these things are completely coupled, and they have dependencies that you can literally write down.
Kieran Chandler: It’s complicated in a way. If I have more demand, I need more stock to fulfill the demand. It’s pretty obvious, and if I put the price at a higher point, then for the demand that I will preserve, I will have a higher margin. There are plenty of things that are like complete, down-to-earth calculations, but the question is, how do you put all those ingredients together in order to do something that is both a prediction and optimization?
Joannes Vermorel: The answer is differentiable programming with specific techniques where you can write those programs that will leave a lot of blank space. That’s going to be all the parameters that you want to optimize, and you will have a supply chain scientist writing that and the proper technology to do the optimization.
Kieran Chandler: So what you’re saying with these blank spaces is that you’re actually writing a program where you don’t actually know all of the answers?
Joannes Vermorel: Yes, that’s correct.
Kieran Chandler: How can you know that and have confidence you’re actually going to get to the right answer if there’s these blank spaces?
Joannes Vermorel: Indeed, it’s similar to the phenomena in machine learning. You’re learning, so you don’t have any guarantees that you will get good results. That being said, I don’t think that nowadays it’s any different from deep learning and all the previous machine learning techniques. For example, machine learning programs have now out-competed all human players in games like Go and chess. So, it’s not as if we didn’t have clear signs that it’s actually working, even beyond human capacity for problems that are still fairly narrow, as opposed to identifying where a dog is in a messy urban environment, which is a much more difficult problem.
Kieran Chandler: You mentioned some of the supply chain perspectives for differentiable programming. From a research and development perspective, how close is the stuff that we’re doing here at Lokad compared to the stuff that Facebook and other big tech companies are doing in differentiable programming?
Joannes Vermorel: I believe the super large tech players like Google, Facebook, and Microsoft have a much larger budget for research and development. At Lokad, we are doing our best, but let’s be realistic, I don’t have even 1% of the AI budget of Microsoft or Facebook. That’s a reality for most B2B companies right now. Those markets are still fairly niche, and there is no company in the supply chain that would say we have 2,000 AI researchers. However, the good news is that those giants like Google, Amazon, and Facebook are actually publishing a lot, so most of their research is published. This means that one of the key challenges we face at Lokad is to keep a close look at all those publications and steadily take inspiration from them. We re-engineer what they are doing but from a supply chain mindset because those large teams are working on the big AI problems like computer vision, speech recognition, speech synthesis, and natural language processing, which is not at all what people are trying to solve in the supply chain.
Kieran Chandler: So how could you go from those big AI problems like image recognition and speech recognition to supply chain? How can they be related at all?
Joannes Vermorel: People have been uncovering the key insight of the progress over the last couple of decades, which is the core mechanism of learning and the core mechanism behind efficient, scalable numerical optimization. What is very interesting is that most of the insights they are uncovering are not specific to images. They’re more at the fundamental level of
Kieran Chandler: So, I’ve heard about a trick that only works for images, but there are many publications where the trick or the insight uncovered is actually not specific at all. It just happens that in order to validate the insights and experiments, they have to engineer a solution, and this solution is engineered for images. But the trick could be applied to completely different problems. It may not naturally work with the same efficiency on completely different problems, but sometimes it will even work better. Can you elaborate on that?
Joannes Vermorel: Yes, you’re right. Sometimes a technique is discovered that is nice for images but not a game-changer. It is published nonetheless, as it is considered novel and contributes to scientific progress. However, when you apply this technique to a different context, like a supply chain, you can achieve significant leaps forward. So, it goes both ways.
Kieran Chandler: What’s the main benefit, from a Lokad perspective, of differential programming? Is it the idea that you can answer all those unknowns that are there in a supply chain and all those blank spaces where you don’t actually know what’s going to happen in the future?
Joannes Vermorel: The big challenge that we face in supply chain is how to have predictive modeling and predictive optimization that doesn’t betray the business. It’s non-trivial because it’s not about getting generic answers; it’s about having a very specific class of decisions that optimize the business in highly specific ways, which align with the business drivers that you define. We don’t try to have AI techniques uncover the business goals; the business goals are established with human-level intelligence to define the strategy and perspective. The problem has a lot of structure, and the biggest challenge is making sure that the numerical recipes you’re crafting align with those business drivers. It’s very difficult – usually, you end up with a round hole and a square peg that just don’t fit. Differentiable programming is one way to address this issue.
Kieran Chandler: Be much more expressive, and this is the key to make sure that you can, you know, frame numerical recipes that are actually fitting the business problem that you have in front of your eyes. You know, if you have access to a programming language, then you are so much more, I would say, expressive and versatile. Suddenly it becomes, you know, in practice, a lot easier to make things fit. Okay, let’s start wrapping things up. You mentioned at the start that differential programming was very much the beginning of a long road with plenty of challenges ahead of it.
Joannes Vermorel: Yes, where we are on that road and what the biggest challenges are that we’re going to face… The biggest challenge is probably to establish a series of, probably identify, a series of constructs and building blocks, you know, programming building blocks that really work well for supply chain. So, remember, automatic differentiation lets you differentiate any program, true, but as you were pointing out, it’s not because you know, you stop or you start pouring parameters in your program and you say, “Well, anything will work because I can optimize my parameters.” No, the reality is that it’s not any kind of program that works. Yes, you can differentiate any program that has those parameters, but the reality is that if you were, you know, writing a program at random with parameters in the middle, when you trigger your optimization, you know, your automatic differentiation, the results that you will get will be complete crap. So no, it has to be, we need to identify specific ways of crafting those problems that actually yield not only results that are not only good but also very steady and reliable. You see, you prefer, we want to have things that we can move to production, so being better on average is not sufficient. You want to have something that is not only better but also very reliable, and so you don’t generate, you know, one-time sort of wonders, something that is like critically bad. You want results that are very steady, very reliable, that do not, I would say, fail in catastrophic ways once in a year. And that’s still probably partly ahead of us.
Kieran Chandler: Brilliant. All right, well, that’s where we’ll leave it today. Thanks for your time, Joannes.
Joannes Vermorel: Thank you.
Kieran Chandler: Thanks very much for tuning in. We’ll be back again next week with our last in this mini-series on differentiable programming. But until then, thanks for watching.