<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2024-02-26T10:44:50+00:00</updated><id>/feed.xml</id><title type="html">John Savage’s Blog</title><subtitle>Machine Learning on the Wild Atlantic Way</subtitle><entry><title type="html">Simulations</title><link href="/science/2023/11/12/simulations-at-ecommerce-company.html" rel="alternate" type="text/html" title="Simulations" /><published>2023-11-12T17:26:36+00:00</published><updated>2023-11-12T17:26:36+00:00</updated><id>/science/2023/11/12/simulations-at-ecommerce-company</id><content type="html" xml:base="/science/2023/11/12/simulations-at-ecommerce-company.html"><![CDATA[<p>Simulations are an underappreciated and underused tool in many companies. 
Most companies have multiple complex systems, many times interacting with each other, which require constant optimisation and simply need constant monitoring to understand.
We are generally afraid to touch these systems as we’re unsure of the impact of the changes on it or on downstream systems.
If we can build simulations that can correspond well enough to these systems to answer important questions about the systems, then we can more freely experiment in that sandbox environment and understand our systems more completely.
Successes at large companies are out there (<a href="https://d1.awsstatic.com/events/Summits/reinvent2022/INO105_Supply-chain-and-logistics.pdf">Amazon</a> and <a href="https://doordash.engineering/2022/08/16/4-essential-steps-for-building-a-simulator/">Doordash</a> are two examples).
I want to argue that teams that are implementing Machine Learning products are the same teams that have complex enough systems to warrant simulating, but are also the same teams with the skill sets to build and use simulations.
In this post I want to go through the different components of a simulation, how to ensure your simulation can answer your questions without overengineering, and also …</p>

<h2 id="what-is-a-simulation">What is a simulation</h2>
<blockquote>
  <p>A simulation is the imitation of the operation of a real-world process or system over time</p>
</blockquote>

<p><em><a href="https://dl.acm.org/doi/pdf/10.1145/324138.324142">INTRODUCTION TO SIMULATION, Jerry Banks</a></em></p>

<p>I think of a simulations as trying to reproduce reality in a simpler, modifiable setting. 
A very tangible example of a simulation is the <a href="https://en.wikipedia.org/wiki/Mississippi_River_Basin_Model">Mississippi River Basin model</a>. 
This was built in the 1940s to aid systematic understanding of flood control measures that had been built on the Mississippi river in the previous decade.
These locks, run-off channels and levees could prevent local flooding but may increase flooding in other areas so it was clear that a big picture understanding of how to implement these measures was needed.
Various segments of the Mississippi River became operational in the simulation throughout the 1950s, helping to avoid damage of an estimated $65 million in 1952 along (almost a billion dollars in today’s money)
Tests on individual problems were conducted until 1971 but high costs and growth of computer modelling meant that the facility was put on standby.
 <img src="/assets/images/simulations/MissBasinModel_Color_Aerial_800x538.jpg" alt="img.png" /> <img src="/assets/images/simulations/Mississippiriver-new-01.png" alt="img.png" /></p>

<p>We can use this example to explore the different parts that make up a simulation.</p>
<h3 id="what-makes-up-a-simulation">What makes up a simulation</h3>
<p>It’s useful to separate the components of a simulation into three different parts: <code class="language-plaintext highlighter-rouge">Input</code>, <code class="language-plaintext highlighter-rouge">Mechanism</code>, and <code class="language-plaintext highlighter-rouge">Output</code></p>

<p><img src="/assets/images/simulations/components.png" alt="img.png" /></p>

<p>For example in the Mississippi River Basin Model, the <code class="language-plaintext highlighter-rouge">Input</code> is the amount of water added at upstream points, and the <code class="language-plaintext highlighter-rouge">Output</code> is the water heights at downstream points at various times.
The <code class="language-plaintext highlighter-rouge">Mechanism</code> is a literal recreation of the landscape at reduced scales, and the simulation propagates along using real physics (gravity, hydrodynamics etc).
Many successful simulations take advantage of re-using parts of the real system for their mechanism.
An important part of the mechanism are the configuration parameters, which in this case would be the various settings of the dams and levees.
Many different simulations can be run with various <code class="language-plaintext highlighter-rouge">Inputs</code> and <code class="language-plaintext highlighter-rouge">Mechanism</code> config parameters to obtain <code class="language-plaintext highlighter-rouge">Outputs</code>.</p>

<p><img src="/assets/images/simulations/components-for.png" alt="img.png" /></p>

<p>Depending on which part of the simulation is generating <em>new</em> data, the simulation is used for different purposes. 
For many practical applications, simulations are used for Prediction, i.e. the Mississippi River Basin Model was mostly used to predict what <code class="language-plaintext highlighter-rouge">Output</code>s would arise from various rainfall levels and flood control measure settings.</p>

<p>For more theoretical applications, simulations are used for Explanation. Given <code class="language-plaintext highlighter-rouge">Inputs</code> that produce known <code class="language-plaintext highlighter-rouge">Outputs</code> in reality, if a <code class="language-plaintext highlighter-rouge">Mechanism</code> can reproduce these we can gain confidence that this corresponds to reality.
This can be particularly useful in applications where fine-grained measurements of reality aren’t possible. Imagine in 1950s that scientists wanted to understand the mechanism of water flow in remote regions of the Mississippi basin.
Before satellites this measurement would have been difficult, but using a simulation these scientists could examine this in fine detail.
In computer simulations, scientists may try to simplify the mechanism to the bare minimum required to reproduce known <code class="language-plaintext highlighter-rouge">Outputs</code> from <code class="language-plaintext highlighter-rouge">Inputs</code>, thus  gaining theoretical understanding of their domain.</p>

<p>A less common application is to use simulations for Retrodiction. An interesting example of this is the theory that a <a href="https://en.wikipedia.org/wiki/Theia_(planet)">planet known as Theia</a> smashed into Earth billions of years ago which resulted in the formation of our Moon.
Many different <code class="language-plaintext highlighter-rouge">Inputs</code> (planet sizes, speeds etc) were <a href="https://www.youtube.com/watch?v=kRlhlCWplqk">trialed</a> to find ones that match an <code class="language-plaintext highlighter-rouge">Output</code> of an Earth and Moon systems like ours (with the <code class="language-plaintext highlighter-rouge">Mechanism</code> being a physics engine)</p>

<h2 id="correspondence">Correspondence</h2>
<blockquote>
  <p>It is the designer’s or user’s intentions that determine what a simulation is a simulation of and what features are to be taken as corresponding with reality.</p>
</blockquote>

<p><em><a href="https://link.springer.com/10.1007/s11229-011-9976-7">How simulations fail</a></em></p>

<p>Regardless of the use of the simulations, there are questions being asked by the users of the simulation. 
Depending on the questions being asked, the simulation will need to more or less complex.
There will be aspects of reality that are clearly not needed in the simulation (e.g. the phases of the moon don’t need to be considered in a molecular simulation)
and aspects of reality that are clearly necessary in the simulation (e.g. the electromagnetic force is necessary for a molecular simulation).
The craft of building a simulation arises in determining the aspects of reality that aren’t needed in the simulation and can be ignored, 
which generally allows the simulation to run for longer or be calibrated more easily (or simply make it feasible to build at all).
The degree to which your simulation contains the aspects of reality required to answer your question is known as the correspondence of your simulation.</p>

<p>To examine this idea of correspondence, we can look at three different methods of simulating the spread of infectious diseases.</p>

<h4 id="time-series">Time Series</h4>
<p><img src="/assets/images/simulations/timeseries-disease.png" alt="img.png" /></p>

<p>In a time series, we predict the <code class="language-plaintext highlighter-rouge">Output</code>, the incidence of disease in the population,
using a predictive model fit on some training data (historical <code class="language-plaintext highlighter-rouge">Inputs</code> and <code class="language-plaintext highlighter-rouge">Outputs</code>), see <a href="https://www.sciencedirect.com/science/article/pii/S0960077920303441">this study</a> as an example.
While this satisfies the technical definition of a simulation, with <code class="language-plaintext highlighter-rouge">Inputs</code>, (an extremely simple) ‘Mechanism’, and <code class="language-plaintext highlighter-rouge">Outputs</code>, 
and can answer important questions about the future of a disease, it doesn’t feel like the type of think we think of when we think of simulations.
Let’s examine some other methods before we examine why this doesn’t seem like a good simulation.</p>

<h4 id="compartmental-model">Compartmental Model</h4>
<p><img src="/assets/images/simulations/compartment-disease.png" alt="img.png" /></p>

<p>In <a href="https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology">compartmental models</a>, a population of people is modelled,
and each person in the population is assigned to one of three buckets: Susceptible, Infectious, or Recovered. 
At the beginning of the simulation all but a small number of people are in the Susceptible bucket. 
At each timestep of the simulation, there a probability that a Susceptible person will move to the Infected bucket, defined by the parameter β.
There is also a probability that a person in the Infected bucket will move to Recovered, defined by the parameter γ.</p>

<p>As this simulation moves through time, we will see an exponential increase in the number of people in the Infected bucket, 
until it reaches a peak and then decays towards zero, as can be seen in the red line in the figure above.
This was the infamous curve that we all wanted to flatten at the beginning of the COVID pandemic, 
which can be flattened in the simulation by decreasing the parameter β. 
In fact, you may remember at the beginning of the COVID pandemic that <a href="https://twitter.com/BenjAlvarez1/status/1250563198081740800">we all kept very close watch</a> on R0, 
the basic reproduction number of the disease, and R0 is simply calculated from β / γ.</p>

<p>By varying the parameters β and γ diseases of different infectivities and populations with different behaviours can be modelled. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> 
This allows researchers and public health officials predict what the likely outcome of various treatments or government measures would be on the spread of the disease.
There are <a href="https://covid19.uclaml.org/model.html">many more extensions</a> that can be made to this compartmental model to better match the details of the disease of interest,
but this one will do for our discussion.</p>

<h4 id="agent-based-model">Agent-based Model</h4>
<p><img src="/assets/images/simulations/agent-disease.png" alt="img.png" />
https://pubmed.ncbi.nlm.nih.gov/16642006/
In <a href="https://en.wikipedia.org/wiki/Agent-based_model#In_epidemiology">agent-based models</a>, a population of people are again modelled,
but this time each person’s locations, movements, activities, and interactions with others etc. are all modeled. 
 This results in the same curve of infection we have seen in the compartmental model, however the large increase in complexity means it much more closely models reality.
This complexity allows the answering of much more complex questions. 
In a <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7095311/">2006 paper</a> a agent-based study looked at the effectiveness of various prevention and containment strategies for an Influenza epidemic.
They found that</p>
<ul>
  <li>border restrictions and/or internal travel restrictions are unlikely to delay spread by more than 2–3 weeks unless more than 99% effective.</li>
  <li>School closure during the peak of a pandemic can reduce peak attack rates by up to 40%, but has little impact on overall attack rates,</li>
  <li>Case isolation or household quarantine could have a significant impact, if feasible</li>
</ul>

<p>This level of detailed prescription likely helped give confidence to governments around the world to implement stay-at-home orders that would damage their economies.
The complexity of these simulations come with a hefty price however, with the many parameters making them extremely difficult to calibrate and the level of detail causing them to be quite expensive to run.</p>

<h4 id="correspondence-1">Correspondence</h4>

<p>Clearly these three types of simulations have very different levels of correspondence to reality. 
For infectious disease modelling, clearly we need to have some representation of individuals and movement of disease between individuals at the very least.</p>
<ul>
  <li>Time series models have no correspondence to these things that matter, so don’t let us ask very many questions that matter</li>
  <li>Compartment models have the bare minimum of correspondence, so we can only ask the bare minimum of questions</li>
  <li>Agent based models have correspondence with many more parts of reality, and so we can ask much more complex questions, at the cost of complexity</li>
</ul>

<p>In tension with this increasing ability to answer important questions about the system of interest, 
is the ability to set up a calibrated simulation and to run the many different realisations of the simulation needed for exploration of the problem space and for statistical significance.</p>

<p>So the design of a simulation comes down to deciding on the questions that need to be answered, what aspects of reality matter to answering those questions.
These aspects of reality clearly must be represented faithfully in the mechanism.</p>

<p>For example in the compartmental model, even though it is an extremely simple simulation, the two aspects of the problem that matter in reality do correspond in the simulation.
What about an aspect of disease spread that we know is important in reality, for example how much people move around and interact with other people.
In the compartmental model, this is baked into the β parameter, so this doesn’t correspond to reality. 
However this is an intentional choice by the designers of the simulation. 
Many important parts of real-life disease spread are abstracted into the β parameter, allowing epidemiologists to study the dynamics of diseases more generally and extremely efficiently.
If we do want to ask questions about people’s movement, then we need to ensure this is represented faithfully in the simulation, as is done in the agent-based model.</p>

<p><img src="/assets/images/simulations/correspondence.png" alt="img.png" /></p>

<p>We can therefore break up the aspects of reality into the parts that matter to answering the question and the parts that don’t.
In the simulation, the parts that matter must correspond in the simulation to gain insights, otherwise you will get erroneous results.
For the parts of reality that don’t matter, making them correspond will generally lead to wasted resources in simulation setup and runtime.
These parts can either be ignored if they are truly not relevant (e.g. we don’t need to simulate tectonic plates in a disease simulation),
or abstracted into parameters or components of the simulation.
For example, the β parameter in the compartmental model doesn’t correspond to anything concrete in reality, 
but stands in for many different aspects of disease spread from infected to non-infected people (e.g. contact rates, disease infectivity, mask wearing etc.)
It’s important to note that correspondence to reality is subjective, and even the agent-based models have been 
<a href="https://www.jasss.org/23/2/10.html">criticized for simplifying and unrealistic assumptions</a></p>

<h3 id="why-use-simulations">Why use simulations</h3>
<ul>
  <li>Make predictions about the future</li>
  <li>Too risky/expensive to do in reality</li>
  <li>Floods of areas</li>
  <li>Too slow to do in reality</li>
  <li>Simulate the cosmic web</li>
  <li>Need to run too many experiments</li>
  <li>Waymo has driven 1000x more in simulations than in reality</li>
  <li>Reinforcement learning requires more data than reality can provide</li>
  <li>Impossible to make measurements of real system</li>
  <li>Protein simulations allow visualization of the movement of proteins</li>
</ul>

<h3 id="conclusion">Conclusion</h3>
<p>You should simulate too. If you want to, look into reinforcement learning papers to find how they’ve simulated since it’s so data hungry.
Simulation is used in many real-world applications, and we in the data/software field should take note of the potential for our domains.</p>

<p>There are many other aspects of running a simulation that are interesting to discuss which I may add in future parts, 
including calibration of simulations and avoiding “production bias” in simulation results, as well as more on the “SimOps” of running and analysing simulations.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>You can watch some beautiful explanations of the compartmental models <a href="https://www.youtube.com/watch?v=7OLpKqTriio">here</a> (and many other simulations on the channel) <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="science" /><summary type="html"><![CDATA[Simulations are an underappreciated and underused tool in many companies. Most companies have multiple complex systems, many times interacting with each other, which require constant optimisation and simply need constant monitoring to understand. We are generally afraid to touch these systems as we’re unsure of the impact of the changes on it or on downstream systems. If we can build simulations that can correspond well enough to these systems to answer important questions about the systems, then we can more freely experiment in that sandbox environment and understand our systems more completely. Successes at large companies are out there (Amazon and Doordash are two examples). I want to argue that teams that are implementing Machine Learning products are the same teams that have complex enough systems to warrant simulating, but are also the same teams with the skill sets to build and use simulations. In this post I want to go through the different components of a simulation, how to ensure your simulation can answer your questions without overengineering, and also …]]></summary></entry><entry><title type="html">Generative AI in Production</title><link href="/2023/09/08/genai-in-production.html" rel="alternate" type="text/html" title="Generative AI in Production" /><published>2023-09-08T17:26:36+00:00</published><updated>2023-09-08T17:26:36+00:00</updated><id>/2023/09/08/genai-in-production</id><content type="html" xml:base="/2023/09/08/genai-in-production.html"><![CDATA[<blockquote>
  <p>People without dirty
hands are wrong.</p>
</blockquote>

<p><em>Cult of Done Manifesto</em></p>

<p>I haven’t been lucky enough to build any Generative AI products other than playing around on my own time.
I was keen to learn from those who have built these and deployed them to customers,
so I attended the <a href="https://home.mlops.community/public/events/llms-in-production-conference-2023-04-13">LLMs in Production conference</a> back in April. 
This blog is a summary of what I learned, mostly from the viewpoint of someone working at a company that is well versed in ML
but is feeling pressure to “do something with Generative AI”.</p>

<p>One thing to keep in mind there are many biases in the views given in topics like these, pessimistic ones from those fearful of change,
optimistic ones from those selling picks and shovels in a gold rush, 
and overoptimistic ones from the future-gazers that come out in every hype cycle. 
I liked this conference as it seemed to balance these viewpoints well, and it was pretty clear which persona the speakers fell into (and thankfully very few of the overoptimistic ones).</p>

<h1 id="intro">Intro</h1>
<p>LLMs are the most successful type of model commonly known as a foundation models, trained to be able to predict what the next word in a sentence would be.
The major differentiator of these models is their architecture allows them to be trained on HUGE amounts of data, basically the entire internet.
In learning how to do this simple task (and particularly with the addition of instruction fine-tuning), 
researchers discovered these models could reason, solve problems, and create new, creative-seeming pieces of work.
These instruction-tuned foundation models were branded with the label Generative AI, most commonly seen with ChatGPT, DALL-E and Copilot.</p>

<p><img src="/assets/images/genai-in-prod/context.png" alt="img.png" /></p>

<p>Putting this in context, we have the broad base of traditional ML that will likely still be required for 
quantitative, low-latency, or interpretable use cases. 
The public will have had the most interaction with copilot-type (i.e. a chat based interface) Generative AI,
however it’s likely that the short-term successes for many non-data-rich companies will come from using these foundation models
for very targeted applications (e.g. summarise this document, what is this user’s need etc.) which, as we’ll see later, can be 
easily tackled by distilled versions.</p>

<p>Beyond copilots, full automation of these models is still being worked on, 
and the long term future is to develop agents, capable of achieving a task e.g. ”go buy me a blue couch for my living room for less than $1000”</p>

<h1 id="ml-product-development">ML Product Development</h1>
<p>The main benefit of LLMs that kept being brought up at the conference was the ability to reach Version 1 of your product much quicker.</p>

<p><img src="/assets/images/genai-in-prod/lifecycle.png" alt="img.png" /></p>

<p>A traditional ML product has generally required the sourcing or creation of datasets, a long process of model training and evaluation,
plus hosting of the model for inference. There has been a growing number of off-the-shelf solutions, but these have been 
for limited use cases, and so were difficult to use as differentiators for your business.
Due to the zero-shot/few-shot capabilities of LLMs (and the fact they’re hosted behind a relatively inexpensive API),
suddenly we have a model capable of solving a huge variety of tasks with just a bit of prompt tuning (and maybe the addition of some context).
While this version of your product may be too expensive or high-latency to use at scale, it crucially allows you to
evaluate product-market fit i.e. does the customer even want to use this.
Too many times in the past we have gone through the traditional ML product lifecycle only to discover that the customer has 
no interest in the product!</p>

<h3 id="what-does-v1-look-like">What does V1 look like</h3>
<p><img src="/assets/images/genai-in-prod/version-1.png" alt="img_1.png" /></p>

<p>The user’s input (e.g. their query to your system) is wrapped in a prompt and sent to an LLM API (e.g. OpenAI). 
The response can then be parsed before sending the desired result to the user. 
Evaluation of the response (i.e. is this a good or bad response) allows for finding prompts that work best. 
Evaluation can be <a href="https://eugeneyan.com/writing/llm-patterns/#evals-to-measure-performance">quite tricky for complex tasks</a>,
so you can get away with <a href="https://en.wiktionary.org/wiki/LGTM">LGTM</a>@few for V1, but the earlier you implement
evaluation the better your product will be.
What if the model doesn’t know anything about the task? 
For example, requires knowledge of data internal to your company or information from after the model was trained
(e.g. the score from today’s football game).
Then you want to somehow retrieve that relevant information and provide it to the model, also known as Retrieval Augmented Generation.</p>

<h3 id="retrieval-augmented-generation-rag">Retrieval Augmented Generation (RAG)</h3>
<p>It is useful to think of the model as a reasoning engine rather than a knowledge store. 
This shift in thinking can help get the best out of foundation models, 
as, due to their training process, they are likely to respond with any answer at all rather than say “I don’t know” i.e.s hallucinate.
Therefore, we will get more success from asking a question and providing relevant context,
which is even true of human interactions. Compare the question “What is the best bed?” to the question
“What is the best bed out of these 100 options, given that I am a 36-year-old male redecorating my daughter’s room”
<img src="/assets/images/genai-in-prod/rag.png" alt="img_2.png" />
Providing relevant information allows the model to respond with information not in its training data and 
also helps prevent fabrication of plausible but non-existent information.
The main limitation is the quality of the search system, if you have low recall you will not be providing the 
model with the facts it requires for its reasoning.</p>

<p>Most descriptions of RAG assume using a vector search engine for this step, presumably because practitioners familiar
with LLMs are more comfortable working with embeddings of text than using information retrieval methods. 
However, there’s no good reason to prefer semantic search over keyword search for this application, and generally 
the best choice for V1 is the search system that’s most prevalent at your company 
(in other words don’t spend your <a href="https://boringtechnology.club/">innovation tokens</a> on a new search system
since you likely need to spend them elsewhere in this novel product).</p>

<h3 id="beyond-v1">Beyond V1</h3>
<p>If you get to the lucky place that V1 is actually used by customers, there are likely to be two major issues to contend with:</p>
<ul>
  <li>Accuracy isn’t sufficient</li>
  <li>Cost/latency isn’t sufficient</li>
</ul>

<p>The main solutions for these problems outlined in this <a href="https://home.mlops.community/public/videos/cost-optimization-and-performance">great panel discussion</a> were:</p>
<ul>
  <li>Accuracy
    <ul>
      <li>Fine tuning</li>
    </ul>
  </li>
  <li>Cost/Latency
    <ul>
      <li>Use in-house model</li>
      <li>Distill/quantize in-house model</li>
    </ul>
  </li>
</ul>

<h3 id="accuracy">Accuracy</h3>
<p>There are a lot of accuracy improvements to be obtained with just the base API model. 
Improvements to RAG context can be obtained with better search systems.
Prompt engineering (e.g. few shot prompting, chain-of-thought prompting) can make large improvements
but changes to the model behind the API can render this work obsolete very quickly.
At a certain point however, fine-tuning is required, particularly in domain specific applications which may not be 
well represented in training data for the API models e.g. legal, finance, health.</p>

<p><img src="/assets/images/genai-in-prod/out-of-the-box.png" alt="img.png" /></p>

<p><a href="https://home.mlops.community/public/videos/solving-the-last-mile-problem-of-foundation-models-with-data-centric-ai">This great talk</a>
highlighted the ability for LLMs to work out-of-the-box for generic prototypes, but falling down in more complex domains.
He showed 4 different use-cases in the banking, pharma, ecommerce, and legal domains with accuracies of 40-60% from 
off the shelf models which rose 20-30 percentage points with fine-tuning.</p>

<h4 id="fine-tuning">Fine-tuning</h4>
<ul>
  <li>Freeze most/all weights of the foundation model, fit an additional set of weights in the model</li>
  <li>Can be done at Open AI/GCP, but in the context of cost/latency section maybe doesn’t make sense
    <ul>
      <li>Relatively cheap to finetune on these platforms so could be a good V1.1</li>
    </ul>
  </li>
  <li>Numerous steps
    <ul>
      <li>Labelling</li>
      <li>Evaluation</li>
      <li>Training</li>
    </ul>
  </li>
  <li>Requires: Labeled examples (data) and evaluation metrics</li>
  <li>Nice to have: OS foundation model, self-hosting</li>
</ul>

<h3 id="cost--latency">Cost &amp; Latency</h3>
<p>Anecdotally, the cost and latency of OpenAI APIs are too high for production applications, 
particularly for high-volume automated tasks or non-chat user facing applications. 
High-volume automated tasks are also the tasks most suitable for training smaller expert models that can be self-hosted.
An example task from this <a href="https://home.mlops.community/public/videos/cost-optimization-and-performance">great panel discussion</a>  was summarising web pages for a web search index.
The OpenAI API allowed them to evaluate whether LLMs were even capable of task, then a
self-hosted, distilled open-source model allowed them to summarise every web page in their index.</p>

<h4 id="in-house-model-hosting">In-house model hosting</h4>
<ul>
  <li>LLMs are BIG</li>
  <li>Naive self-hosting potentially increases the cost of latency</li>
  <li>Need techniques to reduce the size of model while retaining performance</li>
</ul>

<p>Note even shrinking these models 1000x will require engineering prowess to host effectively
<img src="/assets/images/genai-in-prod/nvidia.png" alt="img_4.png" /></p>

<h4 id="quantisation">Quantisation</h4>
<ul>
  <li>Keep model size, reduce resolution of model weights</li>
  <li>For example, float32to float16 or float32 to int8</li>
  <li>Recently been extended to allow this to work during fine-tuning, reducing 780GB memory yo a more manageable 48GB</li>
</ul>

<h4 id="distillation">Distillation</h4>
<ul>
  <li>Use ”teacher” model (billion parameter) to generate training data to train “student” model (million parameter)
    <ul>
      <li><a href="https://blog.research.google/2023/09/distilling-step-by-step-outperforming.html">Good overview from google</a></li>
    </ul>
  </li>
  <li>Note these “small” models are still FAR larger than anything currently hosted at many companies</li>
</ul>

<h3 id="what-does-v2-and-beyond-look-like">What does V2 and beyond look like</h3>
<p><img src="/assets/images/genai-in-prod/version-2.png" alt="img_5.png" /></p>

<p>Using some or all of these techniques, we can reach a state where the accuracy, cost, and latency/throughput of the system is acceptable for production use cases.
Note that we have re-introduced many of the steps from the ML product lifecycle we were able to leave out in the creation of V1.0,
which highlights the fact that there is no free lunch in bringing ML products to production.
This was a recurring theme of the conference, that current gen LLMs are excellent for generic applications and proof-of-concepts,
however non-generic domains/application and production workloads are going to require a lot of custom work.
The idea of simply throwing your problems to an LLM API and walking away is heavily marketed by the API providers,
but the current reality doesn’t seem to bear this out.</p>

<p><img src="/assets/images/genai-in-prod/a16z-arch.png" alt="img.png" /></p>

<p>a16z (who are invested in a lot of “pick and shovel” companies, so take it with a pinch of salt) have created 
<a href="https://a16z.com/emerging-architectures-for-llm-applications/">this architecture diagram</a> for LLM use cases. 
The details aren’t important, what is most noteworthy is how little of the diagram is focused on the model,
and how much is taken up by the supporting infrastructure. This is reminiscent of the infamous Figure 1 in the 
<a href="https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf">Hidden Technical Debt in ML Systems</a> 
paper which has ML as a tiny (yet central) component in the context of the supporting infrastructure.</p>

<p>What Generative AI has really done is accelerated the trend of ML products moving from 
needing science as the lynchpin to needing engineering as the lynchpin.
If your company has been investing in its MLOps capabilities, none of the above should sound worrying to you,
however if they haven’t your company is at risk of falling further behind.</p>

<h1 id="next-part">Next part</h1>
<p>In part 2, we will examine some of the details of the Engineering, Science, and Data needed for succeeding with GenAI products.</p>
<ul>
  <li>Good Science necessary but not sufficient
    <ul>
      <li>Designing good evaluation</li>
      <li>Training and fine-tuning large models is a craft</li>
    </ul>
  </li>
  <li>Mostly an Engineering problem
    <ul>
      <li>This is a trend in traditional ML as seen in the recent history of MLOps</li>
      <li>Generative models exaggerates this due to greater capabilities and risks</li>
    </ul>
  </li>
  <li>Data
    <ul>
      <li>Lots to discuss!</li>
    </ul>
  </li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[People without dirty hands are wrong.]]></summary></entry><entry><title type="html">Exercise for longevity</title><link href="/exercise/2023/01/09/exercise-for-longevity.html" rel="alternate" type="text/html" title="Exercise for longevity" /><published>2023-01-09T17:26:36+00:00</published><updated>2023-01-09T17:26:36+00:00</updated><id>/exercise/2023/01/09/exercise-for-longevity</id><content type="html" xml:base="/exercise/2023/01/09/exercise-for-longevity.html"><![CDATA[<p>See <a href="https://exercise-for-longevity.streamlit.app">here</a> to access the full app.</p>

<iframe src="https://exercise-for-longevity.streamlit.app?embed=true" width="800" height="1200">
  <p>Your browser does not support iframes.</p>
</iframe>]]></content><author><name></name></author><category term="exercise" /><summary type="html"><![CDATA[See here to access the full app.]]></summary></entry><entry><title type="html">Divvy vs Chicago Weather</title><link href="/2015/03/15/divvy-vs-chicago-weather.html" rel="alternate" type="text/html" title="Divvy vs Chicago Weather" /><published>2015-03-15T17:26:36+00:00</published><updated>2015-03-15T17:26:36+00:00</updated><id>/2015/03/15/divvy-vs-chicago-weather</id><content type="html" xml:base="/2015/03/15/divvy-vs-chicago-weather.html"><![CDATA[<h2 id="exploring-chicagos-weather">Exploring Chicago’s weather</h2>
<p>Everyone who’s lived in Chicago knows about its weather. We get all four seasons, and not necessarily in the expected order. The “feels like” temperature last year ranged from -40 to 95 degrees Fahrenheit. The Windy City lived up to its name with gusts up to 55 mph and the rain and snow was ever present. Throughout all this though, Divvy cyclists powered through, cycling on all but two days of the year, and that’s only because the system was shut down!! Let’s explore how the various forms of weather affected Divvy riders last year.</p>

<hr />

<h2 id="temperature">Temperature</h2>
<h3 id="temperature-and-divvy-trips-for-every-day-in-2014">Temperature and Divvy trips for every day in 2014</h3>

<p>Obviously no-one likes cycling in cold weather, so a good place to see how weather affects Divvyers is the temperature. The graph below shows the average “feels like” temperature in Chicago for every day in 2014. You can see the long, cold period we had in the first few months, then a lovely mild summer with no sweltering hot days, followed by a relatively mild beginning of winter at the end of 2014. On the same graph, I’ve plotted the number of Divvy trips taken each day in 2014 and, as you’d expect, when the weather warms up throughout the year, people use Divvy much more. It’s especially interesting to look at the times in March when the temperature first gets above 40 degrees, Divvy ridership spikes correspondingly, just like the first time the temperature spikes above 60 degrees in May. From seeing this overall trend, let’s see how well temperature and ridership are correlated.
<img src="/assets/images/dailyTempTrips.png" alt="image" />
<em>Figure 1: The number of trips taken each day of 2014 tracks very well with the temperature for the day. This is especially true for days when there is a significant increase in temperature. There’s nothing like a sunny day in March to make you want to cycle down the lakeshore!!</em></p>
<h3 id="temperature-vs-number-of-trips">Temperature vs Number of Trips</h3>

<p>In the next figure, I’ve plotted the temperature and number of trips for each day against each other. In case you’re unfamiliar with this type of visualization, each point on the graph is an individual day. The x-position for the point is the temperature for that day and the y-position is the number of Divvy trips taken that day. As you can see, as the temperature increases, the number of trips taken increases rapidly. The overall trend roughly follows (temperature)2 i.e. when the temperature doubles, the number of trips taken quadruples. It’s also possible to fit the data to two straight lines, one below freezing and one above freezing since it’s pretty easy to see that there is an elbow in the graph at around 32 degrees F. Unfortunately (for the purposes of analysis, not for real life!) this summer was particularly mild, so we can’t see what effect really hot weather has on Divvy ridership. I would imagine that the number of trips per day would start to fall off as the temperature got above 100 degrees F, but we can’t see that from the 2014 data. There’s maybe a hint of this effect in the fact that the top 5 days for Divvy ridership all occured below 75 degrees F, but we’ll need more hot weather data to be sure.
<img src="/assets/images/tripsVtemp.png" alt="image" />
<em>Figure 2: Temperature and number of trips taken correlate very well, following both a (temperature)2 trend or two linear trends above and below freezing. We don’t see any falloff in number of trips taken in very warm weather just yet.</em></p>

<p>Divvy includes in the data about each trip whether the person taking the trip has a Divvy membership or whether they just have a daily Divvy pass. My assumption about these two groups of people is that Divvy members primarily use it for their commutes and non-members are tourists or pleasure-bikers, and the temperature data helps to bear that out. The plot below is the same as Figure 2, but now I have split the data into whether the person taking a trip is a member or non-member. Looking at the non-member data, we can see that if the temperature is below freezing, there are barely any non-members using Divvy, and as the weather warms up there is an explosion in trips taken, up to 12,000 trips per day. Members however are much hardier creatures, with thousands of trips per day still being taken in negative temperatures!! There isn’t a much better testament to the toughness of Chicagoans than the fact there were over 400 trips taken on days when the temperature felt like negative 20 degrees fahrenheit!!! Thinking about the lunatics who were out cycling in that type of temperature made me wonder how long they were suffering out there, and then how long trips taken on Divvy are in general, and whether they get shorter as the temperature drops.
<img src="/assets/images/member_tripsVtemp.png" alt="image" />
<em>Figure 3: Members (those with yearly passes) and non-members (those with daily passes) show very different behaviour. Members tend to use Divvy for commuting and so travel even in very cold weather, whereas non-members use Divvy for pleasure, and there’s nothing pleasurable about cycling in below freezing weather, so they take very few trips below 32 degrees F.</em></p>
<h3 id="temperature-vs-length-of-trips">Temperature vs Length of Trips</h3>

<p>In this figure I’ve plotted the average trip duration vs the temperature. It’s interesting to note that the maximum length of time you can have a Divvy bike out before you are charged extra is 30 minutes, and the average length of trips for every day is less than 30 minutes. People really don’t want to pay extra! More interestingly for this discussion, we can see a clear relationship between temperature and average length of trip. The average length of a trip at 80 degrees F is roughly 5 minutes longer that the average length of a trip at 0 degrees F. We can drill down into the main cause of this by examining the differences between members and non-members in the next figure.
<img src="/assets/images/lengthVtemp.png" alt="image" />
<em>Figure 4: As temperature increases, the average length of a Divvy trip increases.</em></p>

<p>In this figure you can see the large difference between the behavior of members vs non-members. The average length of a member trip is about 10 minutes, which jibes well with the idea that most members use Divvy for the last part of their commute, e.g. from the train to their house. The average length of a non-member trip is much closer to 30 minutes, the maximum allowed time you can take a bike out before extra charges. Clearly non-members want to get the most “value” from their daily pass and are taking much longer trips. In addition, both types of behavior change very little as temperature changes. Members consistently average about 10 minutes per trip in cold weather or warm, with a slight increase of about 2 minutes over the range of temperatures we saw. Non-members show a slightly larger increase in average trip length, but the large increase we see in average trip length we saw in Figure 4 has more to to with the shift in the types of users at different temperatures we saw in Figure 3, with more members at low temperature, and a 50/50 mix of members and non-members at higher temperatures.
<img src="/assets/images/member_lengthVtemp.png" alt="image" />
<em>Figure 5: Members and non-members show very different behavior. Members take short trips in all weather but non-members take much longer trips and their trips get shorter as the weather cools down.</em></p>

<hr />

<h2 id="rain-well-actually-clouds">Rain (well, actually clouds!)</h2>

<p>As there was great correlation between temperature and people’s Divvy riding habits, I was excited to see how other types of weather would affect Divvy ridership. Again, no-one likes riding in the rain, so I expected to see much fewer trips on days when it rained. In the figure below, we can see there’s no correlation between the probability of rain and Divvy ridership. There tended to be just as many trips on days where there was a 100% probability of rain as there were on days where there was a 0% probability of rain.
<img src="/assets/images/member_tripsVprecip.png" alt="image" />
<em>Figure 6: No correlation is seen between the probability of precipitation and number of Divvy riders. This is somewhat difficult to see in the points alone, so I’ve fit a straight line to the data to show that there is almost no decrease in the number of trips taken per day as the precipitation probability increases.</em></p>

<p>When we look at cloud cover though, (where 0 = blue skies and 1 is a totally cloudy sky) we can see there is a definite decrease in the average number of trips taken per day when the cloud cover increases. This suggests that people use the clouds as their own rain forecast. Similar to temperature, we see that high cloud cover really cuts the non-member trips down, while members are more happy to risk getting caught in a shower.
<img src="/assets/images/member_tripsVcloud.png" alt="image" />
<em>Figure 7: There is a strong correlation between cloud cover and number of trips taken, especially for non-members. If the cloud cover is above 75%, there are never more than 2,000 non-members trips, where there are up to 12,000 non-member trips on clearer days.</em></p>

<hr />

<h2 id="the-windy-city">The Windy City</h2>
<p>A discussion about the effects of weather in Chicago would not be complete without talking about the wind! The wind data gives the average daily wind speed in miles per hour, with average speeds up to 25 mph seen in 2014. It should be noted that these daily averages will be lower than the max wind speeds seen. As we can see in the figure, wind speed has very little effect on the number of trips taken when it’s below about 15 mph, with no clear correlation between wind speed and number of trips taken in a day. However, once the average wind speed gets above 15 mph, the number of trips taken per day drops off dramatically. 15 mph corresponds to a moderate to fresh breeze at which point “dust and loose paper raised, small branches begin to move” according to wikipedia, so it clearly doesn’t take much of an average wind speed to stop people Divvying. It’s also interesting to note that the drop off in number of trips taken happens at 15 mph for both members and non-members. For both temperature and rain/cloud cover, we saw that members were willing to take their short 10 minute cycles in worse weather than non-members did. Wind is the great equalizer though it seems, no-one wants to battle through a windy day on a Divvy.
<img src="/assets/images/member_tripsVwind.png" alt="image" />
<em>Figure 8: There is no correlation between wind speed and number of trips taken when the wind speed is below about 15 mph. Once the average wind speed for the day gets above about 15 mph, the number of trips for the day drops to much lower values.</em></p>

<p>There is one clear outlier on this figure however, the member point at wind speed = 23 mph, number of trips = 4000. It got me worrying whether I had made some mistake in processing my data, as I didn’t think it was reasonable for 4000 people to be willing to cycle on one of the windiest days Chicago had last year! Once I looked at the day for that data point I realized what was going on. That data point represents October 31st 2014, Hallowe’en, so clearly people were willing to cycle to their costume parties whatever the weather. I still wanted to make sure I hadn’t made some error in processing the data, so I checked the weather in Chicago for Hallowe’en in 2014 and I came across this video.</p>

<p>Nothing embodies the ridiculous levels that Divvy riders went last year better than <a href="http://www.youtube.com/embed/55LextyEgsk">this video</a>. Not only did someone walk up to a Divvy station, see the bikes blowing in the wind like flags and still hop on and cycle off, but thousands of other someones did the exact same thing!! Either Chicagoans DGAF about the weather at Hallowe’en or we had 4000 extremely commited Wicked Witches of the West.</p>
<h2 id="conclusion">Conclusion</h2>

<p>Overall we’ve seen that the weather has a strong effect on Divvy riding. Non-members show the most sensitivity to the weather, which fits with their usage of Divvy for pleasure. Members tend to be more consistent in their riding patterns, which fits with their usage of Divvy for short commutes. We also saw that temperature has by far the strongest effect on ridership, with strong winds and cloudy skys also having an effect. Finally, we have proof for the claim that Chicagoans are don’t care about bad weather, with many trips taken by members on days with negative Fahrenheit temperatures and 50 mph gusts of wind.</p>

<p>If anyone would like to see how I analyzed the Divvy and weather data, I have all my code on <a href="https://github.com/savagej/DivvyWeather2014">github</a> where you can download it and play with it yourself. All weather data is from <a href="http://www.forecast.io">forecast.io</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Exploring Chicago’s weather Everyone who’s lived in Chicago knows about its weather. We get all four seasons, and not necessarily in the expected order. The “feels like” temperature last year ranged from -40 to 95 degrees Fahrenheit. The Windy City lived up to its name with gusts up to 55 mph and the rain and snow was ever present. Throughout all this though, Divvy cyclists powered through, cycling on all but two days of the year, and that’s only because the system was shut down!! Let’s explore how the various forms of weather affected Divvy riders last year.]]></summary></entry></feed>