<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://blancas.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blancas.io/" rel="alternate" type="text/html" /><updated>2025-02-10T23:43:22+00:00</updated><id>https://blancas.io/feed.xml</id><title type="html">Eduardo Blancas</title><subtitle>Personal website</subtitle><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><entry><title type="html">How do AI agents work, anyway?</title><link href="https://blancas.io/blog/smolagents/" rel="alternate" type="text/html" title="How do AI agents work, anyway?" /><published>2025-02-10T00:00:00+00:00</published><updated>2025-02-10T00:00:00+00:00</updated><id>https://blancas.io/blog/smolagents</id><content type="html" xml:base="https://blancas.io/blog/smolagents/"><![CDATA[<!-- 
At [work](https://ploomber.io/), we've initiated several projects incorporating
AI agents into our operations and product, which sparked my interest in better understanding
the technology. While exploring resources, I noticed a significant gap
between the concepts and  implementations so I chose to examine one
framework (`smolagents`) in detail to understand how it works. -->

<p>In this post, I’ll provide a brief conceptual introduction to AI agents and analyze
the implementation of the <code class="language-plaintext highlighter-rouge">smolagents</code> library by examining its OpenAI API calls using
<code class="language-plaintext highlighter-rouge">mitmproxy</code> and DuckDB.</p>

<h1 id="what-are-agents">What are agents?</h1>

<p>An AI agent is a program that performs actions through a set of tools. For example,
ChatGPT is an agent that can search the web via a tool. Agents use Large Language Models
(LLMs) to break down tasks into smaller ones (planning), choose which tools to use at
each step, and determine when the task is complete.</p>

<p>Tools are typically functions (like Python functions) that the agent calls to
retrieve results or perform actions (such as writing to a database).</p>

<p>The plan is the series of steps that the agent will perform. Not all plans are created
equal: shorter plans and less computationally expensive plans are desirable.</p>

<p>To learn more about AI agents, check out
<a href="https://huyenchip.com/2025/01/07/agents.html#planning">Chip Huyen’s blog post</a>
and the <a href="https://huggingface.co/blog/smolagents">smolagents</a> blog post.</p>

<h1 id="code-agents">Code agents</h1>

<p>Since the LLM decides which tools to run at each step, we need a way to represent
tool calling (aka function calling). Code agents represent their tool calls using
actual code (e.g., Python code), in contrast to other agents which represent tool
calls with JSON. <a href="https://arxiv.org/abs/2402.01030">Research has shown</a> that code-based tool
calling produces more effective agents.</p>

<p>We’ll be using the <a href="https://github.com/huggingface/smolagents">smolagents</a> framework
to understand how agents work with the code agent configuration (though you can
also use the JSON configuration).</p>

<h1 id="setup">Setup</h1>

<p>First, let’s install the required dependencies:</p>

<ul>
  <li><a href="https://github.com/huggingface/smolagents">smolagents</a> for running the agent</li>
  <li><a href="https://mitmproxy.org/">mitmproxy</a> for intercepting OpenAI API requests</li>
  <li><a href="https://duckdb.org/">DuckDB</a> for querying the OpenAI API logs</li>
  <li><a href="https://github.com/Textualize/rich">rich</a> for prettier terminal output</li>
</ul>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Install packages (including litellm for OpenAI model support)</span>
pip <span class="nb">install</span> <span class="s1">'smolagents[litellm]'</span> mitmproxy duckdb rich
</code></pre></div></div>

<p>Download the code:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone <span class="nt">--depth</span> 1 https://github.com/edublancas/posts
<span class="nb">cd </span>smolagents/
</code></pre></div></div>

<p>Next, start the reverse proxy to intercept OpenAI requests and log them to a <code class="language-plaintext highlighter-rouge">.jsonl</code> file that we’ll query later with DuckDB:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mitmdump <span class="nt">-s</span> proxy_logger.py <span class="nt">--mode</span> reverse:https://api.openai.com <span class="nt">--listen-port</span> 8080
</code></pre></div></div>

<h1 id="basic-example-multiplication-no-tools">Basic example (multiplication, no tools)</h1>

<p>Let’s start with a basic example: asking the model to perform a simple multiplication without providing any tools.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">smolagents</span> <span class="kn">import</span> <span class="n">CodeAgent</span><span class="p">,</span> <span class="n">OpenAIServerModel</span>

<span class="c1"># Initialize the model with our reverse proxy
</span><span class="n">model</span> <span class="o">=</span> <span class="n">OpenAIServerModel</span><span class="p">(</span>
    <span class="n">model_id</span><span class="o">=</span><span class="s">"gpt-4o-mini"</span><span class="p">,</span>
    <span class="n">api_base</span><span class="o">=</span><span class="s">"http://localhost:8080/v1"</span><span class="p">,</span>
<span class="p">)</span>

<span class="c1"># Create an agent with no tools
</span><span class="n">agent</span> <span class="o">=</span> <span class="n">CodeAgent</span><span class="p">(</span><span class="n">tools</span><span class="o">=</span><span class="p">[],</span> <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">add_base_tools</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="c1"># Run the agent with a simple multiplication task
</span><span class="n">agent</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="s">"How much is 2 * 21?"</span><span class="p">)</span>
</code></pre></div></div>

<p>After running the code, we can view the execution logs by running <code class="language-plaintext highlighter-rouge">python print.py</code>, which displays all logs from the <code class="language-plaintext highlighter-rouge">.jsonl</code> file.</p>

<h2 id="prompt">Prompt</h2>

<p>Here’s the prompt for the API call, I removed several parts to make it shorter (several
parts are redundant to deal with the stuborrness of LLMs), but kept the overall message the
same. You can see the complete logs in the <code class="language-plaintext highlighter-rouge">openai_logs_no_base_tools.jsonl</code> file.</p>

<h3 id="system">System</h3>

<p>The system prompt indicates the model what their purpose is and the rules it must abide
to. It essentially tells the LLM that its job is to solve tasks with tools, and
that solving the tasks involves three steps: <code class="language-plaintext highlighter-rouge">Thought</code>, <code class="language-plaintext highlighter-rouge">Code</code>, and <code class="language-plaintext highlighter-rouge">Observation</code>:</p>

<div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You are an expert assistant who can solve any task using code blobs. You have been
given access to a list of tools: these tools are Python functions.

To solve the task, you must plan forward to proceed in a series of steps, in a cycle
of 'Thought:', 'Code:', and 'Observation:' sequences.

'Thought:' sequence, you should explain your reasoning and the tools that you want to use.

'Code:' sequence, you should write the code in Python.

During each intermediate step, use <span class="sb">`print()`</span> to save important information.
These <span class="sb">`print`</span> outputs will then appear in the 'Observation:' field, which will be
available as input for the next step.

In the end you have to return a final answer using the <span class="sb">`final_answer`</span> tool.

Here are a few examples:
<span class="p">
---
</span>
Task: "What is the result of the following operation: 5 + 3 + 2"

Thought: I will use python code to compute the result of the operation and then return
the final answer using the <span class="sb">`final_answer`</span> tool

Code:
<span class="p">```</span><span class="nl">py
</span><span class="n">result</span> <span class="o">=</span> <span class="mi">5</span> <span class="o">+</span> <span class="mi">3</span> <span class="o">+</span> <span class="mi">2</span>
<span class="n">final_answer</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
<span class="p">```</span>
<span class="gh">&lt;end_code&gt;
---
</span>
[...MORE EXAMPLES HERE]
</code></pre></div></div>

<p>After listing a few more examples, the system prompt includes the available tools (we
only have the <code class="language-plaintext highlighter-rouge">final_answer</code> tool) and the rules it must abide by:</p>

<div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You only have access to these tools:<span class="sb">


</span><span class="p">-</span> final_answer: Provides a final answer to the given problem.
    Takes inputs: {'answer': {'type': 'any', 'description': 'The final answer to the problem'}}
    Returns an output of type: any
[... MORE TOOLS ARE ADDED HERE, IF ANY]<span class="sb">


</span>Here are the rules you should always follow to solve your task:
<span class="p">
1.</span> Always provide a 'Thought:' sequence, and a 'Code:
<span class="p">```</span><span class="nl">py' sequence ending with '```&lt;end_code&gt;' sequence, else you will fail.
</span><span class="sb">[...MORE RULES HERE]
</span></code></pre></div></div>

<h3 id="user">User</h3>

<p>The next message has <code class="language-plaintext highlighter-rouge">{"role": "user"}</code>, and it contains the task to perform:</p>

<blockquote>
  <p>New task:
How much is 2 * 21?</p>
</blockquote>

<h2 id="response">Response</h2>

<p>Remember that one of the rules in the system prompt says:</p>

<blockquote>
  <p>Always provide a ‘Thought:’ sequence, and a ‘Code:’ sequence</p>
</blockquote>

<p>Hence, the model proceeds to return <code class="language-plaintext highlighter-rouge">Thought:</code> and <code class="language-plaintext highlighter-rouge">Code:</code></p>

<div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thought: This is a simple multiplication task. I will multiply 2 by 21 and return the
result using the <span class="sb">`final_answer`</span> tool.

Code:

<span class="p">```</span><span class="nl">py
</span><span class="n">result</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="mi">21</span>
<span class="n">final_answer</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
<span class="p">```</span>
</code></pre></div></div>

<p>The agent then runs the Python code, and since the <code class="language-plaintext highlighter-rouge">Code:</code> portion already uses
<code class="language-plaintext highlighter-rouge">final_answer</code>, it knows it has finished.</p>

<h2 id="output">Output</h2>

<p>Here’s the output that <code class="language-plaintext highlighter-rouge">agent.run</code> displays:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>╭──────────── New run ────────────╮
│                                 │
│ How much is 2 * 21?             │
│                                 │
╰─ OpenAIServerModel - gpt-4o-mini╯
━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━
 ─ Executing parsed code: ─────────────
  result = 2 * 21
  final_answer(result)
 ──────────────────────────────────────
Out - Final answer: 42
[Step 0: Duration 1.08 seconds| Input tokens: 1,956 | Output tokens: 52]
</code></pre></div></div>

<h1 id="multi-step-example-fibonacci-series-no-tools">Multi-step example (fibonacci series, no tools)</h1>

<p>The previous task was trivial (no tools required, besides the <code class="language-plaintext highlighter-rouge">final_answer</code> tool). The
true potential of AI agents lies in performing complex tasks by using tools. Let’s look
at another agent whose plan involves two steps.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">smolagents</span> <span class="kn">import</span> <span class="n">CodeAgent</span><span class="p">,</span> <span class="n">OpenAIServerModel</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">OpenAIServerModel</span><span class="p">(</span>
    <span class="n">model_id</span><span class="o">=</span><span class="s">"gpt-4o-mini"</span><span class="p">,</span>
    <span class="n">api_base</span><span class="o">=</span><span class="s">"http://localhost:8080/v1"</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">agent</span> <span class="o">=</span> <span class="n">CodeAgent</span><span class="p">(</span><span class="n">tools</span><span class="o">=</span><span class="p">[],</span> <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">add_base_tools</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="n">agent</span><span class="p">.</span><span class="n">run</span><span class="p">(</span>
    <span class="s">"Could you give me the 118th number in the Fibonacci sequence?"</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<h2 id="prompt-1st-step">Prompt (1st step)</h2>

<h3 id="system-1">System</h3>

<p>The system prompt is the same as in the previous example.</p>

<h3 id="user-1">User</h3>

<blockquote>
  <p>New task:
Could you give me the 118th number in the Fibonacci sequence?</p>
</blockquote>

<h2 id="response-1st-step">Response (1st step)</h2>

<p>Unlike the first example, this response doesn’t use <code class="language-plaintext highlighter-rouge">final_answer</code>:</p>

<div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thought: To find the 118th number in the Fibonacci sequence, I can use a simple
iterative approach to compute Fibonacci numbers until the 118th index. This will be
efficient and easy to implement. I'll create two variables to hold the last two
Fibonacci numbers and iteratively compute the next Fibonacci number until I reach
the 118th one.

Code:
<span class="p">```</span><span class="nl">py
</span><span class="k">def</span> <span class="nf">fibonacci</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
    <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span>
    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
        <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">b</span><span class="p">,</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
    <span class="k">return</span> <span class="n">a</span>

<span class="n">fibonacci_118</span> <span class="o">=</span> <span class="n">fibonacci</span><span class="p">(</span><span class="mi">118</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">fibonacci_118</span><span class="p">)</span>
<span class="p">```</span>
</code></pre></div></div>

<p>This example highlights one of the key challenges in agent systems: planning efficiency.
While a task can often be solved in multiple ways, some plans are more efficient than
others. Longer plans require more computational steps and time to execute, but current
models don’t necessarily optimize for the shortest possible solution.</p>

<p>In this case, the model chose a two-step approach when a single step would have
sufficed: it could have directly called <code class="language-plaintext highlighter-rouge">final_answer(fibonacci_118)</code> instead of
using <code class="language-plaintext highlighter-rouge">print(fibonacci_118)</code>.</p>

<p>Since the model didn’t use <code class="language-plaintext highlighter-rouge">final_answer</code> to conclude the task, we need to proceed
with an additional step.</p>

<h2 id="prompt-2nd-step">Prompt (2nd step)</h2>

<h3 id="system-2">System</h3>

<p>The system prompt is the same as in the previous example.</p>

<h3 id="user-2">User</h3>

<blockquote>
  <p>New task:
Could you give me the 118th number in the Fibonacci sequence?</p>
</blockquote>

<h3 id="assistant">Assistant</h3>

<p>Unlike the first example where we had a single step, this time we have an assistant message.
This message includes the output from the previous step (<code class="language-plaintext highlighter-rouge">Thought:</code> and <code class="language-plaintext highlighter-rouge">Code:</code>) along
with information about the tools that were called (<code class="language-plaintext highlighter-rouge">Calling tools:</code>).</p>

<div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thought: To find the 118th number in the Fibonacci sequence, I can use a simple iterative approach to compute Fibonacci
numbers until the 118th index. This will be efficient and easy to implement. I'll create two variables to hold the last
two Fibonacci numbers and iteratively compute the next Fibonacci number until I reach the 118th one.

Code:
<span class="p">```</span><span class="nl">py
</span><span class="k">def</span> <span class="nf">fibonacci</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
    <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span>
    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
        <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">b</span><span class="p">,</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
    <span class="k">return</span> <span class="n">a</span>

<span class="n">fibonacci_118</span> <span class="o">=</span> <span class="n">fibonacci</span><span class="p">(</span><span class="mi">118</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">fibonacci_118</span><span class="p">)</span>
<span class="p">```</span>
Calling tools:
[{'id': 'call_1', 'type': 'function', 'function': {'name': 'python_interpreter', 'arguments': 'def fibonacci(n):<span class="se">\n</span>    a,
b = 0, 1<span class="se">\n</span>    for _ in range(n):<span class="se">\n</span>        a, b = b, a + b<span class="se">\n</span>    return a<span class="se">\n\n</span>fibonacci_118 =
fibonacci(118)<span class="se">\n</span>print(fibonacci_118)'}}]
</code></pre></div></div>

<h3 id="user-3">User</h3>

<p>To generate the final message, the agent runs the Python code, and prints the output:</p>

<div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Call id: call_1
Observation:
Execution logs:
2046711111473984623691759
Last output from code snippet:
None
</code></pre></div></div>

<h2 id="response-1">Response</h2>

<p>The response from the second API call shows that the model has already identified the
final output and produces a new code snippet that just uses <code class="language-plaintext highlighter-rouge">final_answer</code>, which
finishes the agent execution.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thought: The 118th number in the Fibonacci sequence is 2046711111473984623691759. Now I will provide this as the final
answer using the `final_answer` tool.

Code:
```py
final_answer(2046711111473984623691759)
```
</code></pre></div></div>

<h1 id="output-1">Output</h1>

<p>This is the output we see in the terminal.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>╭──────────── New run ────────────╮
│                                 │
│ Could you give me the 118th     │
│ number in the Fibonacci         │
│ sequence?                       │
│                                 │
╰─ OpenAIServerModel - gpt-4o-mini╯
━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━
 ─ Executing parsed code: ─────────────
  def fibonacci(n):
      a, b = 0, 1
      for _ in range(n):
          a, b = b, a + b
      return a

  fibonacci_118 = fibonacci(118)
  print(fibonacci_118)
 ──────────────────────────────────────
Execution logs:
2046711111473984623691759

Out: None
[Step 0: Duration 2.70 seconds| Input tokens: 1,961 | Output tokens: 127]
━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━
 ─ Executing parsed code: ─────────────
  final_answer(2046711111473984623691759)
 ──────────────────────────────────────
Out - Final answer: 2046711111473984623691759
[Step 1: Duration 1.55 seconds| Input tokens: 4,179 | Output tokens: 186]
</code></pre></div></div>

<h1 id="final-thoughts">Final thoughts</h1>

<p>The concept of AI agents is rapidly evolving. Encouragingly, a consensus is emerging
around their core concepts: agents plan and use tools to accomplish tasks.
However, research in this field is still a work in progress. As new research emerges
and more powerful models are developed, many existing frameworks will likely become
outdated. This is why I believe it’s crucial to understand what’s happening behind the
scenes; specifically, what API calls are being made to the LLM. This understanding
allows us to grasp their strengths and weaknesses, customize their behavior, or
even develop our own solutions when existing options don’t meet our needs.</p>

<p>A significant limitation of current frameworks is their reliance on hardcoded prompts,
as there’s no guarantee these prompts will perform optimally for specific tasks.
I predict that future agent frameworks will evolve into meta-frameworks, offering
greater flexibility to customize prompts and choose between different planning
strategies (such as defining a complete plan upfront versus incrementally adding steps
until reaching a stopping condition, like <code class="language-plaintext highlighter-rouge">smolagents</code> does).</p>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Exporting Shiny apps with Shinylive</title><link href="https://blancas.io/blog/shinylive/" rel="alternate" type="text/html" title="Exporting Shiny apps with Shinylive" /><published>2024-11-22T00:00:00+00:00</published><updated>2024-11-22T00:00:00+00:00</updated><id>https://blancas.io/blog/shinylive</id><content type="html" xml:base="https://blancas.io/blog/shinylive/"><![CDATA[<p>I was playing around with <a href="https://posit-dev.github.io/r-shinylive/">Shinylive</a> but
encounter some issues when using it, here’s the code that worked. It uses Docker
to make it reproducible.</p>

<p>Copy all the following files in the same directory:</p>

<h2 id="appappr"><code class="language-plaintext highlighter-rouge">app/app.R</code></h2>

<p>Some hello world app for testing. Note that this is under the <code class="language-plaintext highlighter-rouge">app/</code> directory, you must
create this file under an <code class="language-plaintext highlighter-rouge">app/</code> directory.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">shiny</span><span class="p">)</span><span class="w">

</span><span class="c1"># Define UI</span><span class="w">
</span><span class="n">ui</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fluidPage</span><span class="p">(</span><span class="w">
    </span><span class="n">titlePanel</span><span class="p">(</span><span class="s2">"Hello World Shiny App"</span><span class="p">),</span><span class="w">
    
    </span><span class="n">sidebarLayout</span><span class="p">(</span><span class="w">
        </span><span class="n">sidebarPanel</span><span class="p">(</span><span class="w">
            </span><span class="n">textInput</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Enter your name:"</span><span class="p">,</span><span class="w"> </span><span class="s2">"World"</span><span class="p">)</span><span class="w">
        </span><span class="p">),</span><span class="w">
        
        </span><span class="n">mainPanel</span><span class="p">(</span><span class="w">
            </span><span class="n">h3</span><span class="p">(</span><span class="s2">"Greeting:"</span><span class="p">),</span><span class="w">
            </span><span class="n">textOutput</span><span class="p">(</span><span class="s2">"greeting"</span><span class="p">)</span><span class="w">
        </span><span class="p">)</span><span class="w">
    </span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1"># Define server logic</span><span class="w">
</span><span class="n">server</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">output</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">output</span><span class="o">$</span><span class="n">greeting</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">renderText</span><span class="p">({</span><span class="w">
        </span><span class="n">paste</span><span class="p">(</span><span class="s2">"Hello,"</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="o">$</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="s2">"!"</span><span class="p">)</span><span class="w">
    </span><span class="p">})</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># Run the application</span><span class="w">
</span><span class="n">shinyApp</span><span class="p">(</span><span class="n">ui</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ui</span><span class="p">,</span><span class="w"> </span><span class="n">server</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">server</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="convertr"><code class="language-plaintext highlighter-rouge">convert.R</code></h2>

<p>Simple script to convert the Shiny app to Shinylive.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">shinylive</span><span class="o">::</span><span class="n">export</span><span class="p">(</span><span class="s2">"/app"</span><span class="p">,</span><span class="w"> </span><span class="s2">"site"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></h2>

<div class="language-Dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> r-base</span>

<span class="c"># install OS requirements</span>
<span class="k">RUN </span>apt-get update <span class="o">&amp;&amp;</span> apt-get <span class="nb">install</span> <span class="nt">-y</span> <span class="se">\
</span>    libarchive-dev libssl-dev libcurl4-openssl-dev <span class="se">\
</span>    <span class="o">&amp;&amp;</span> <span class="nb">rm</span> <span class="nt">-rf</span> /var/lib/apt/lists/<span class="k">*</span>

<span class="c"># install R requirements</span>
<span class="k">COPY</span><span class="s"> install.R /_shinylive/install.R</span>
<span class="k">RUN </span>Rscript /_shinylive/install.R

<span class="c"># copy the converter script</span>
<span class="k">COPY</span><span class="s"> convert.R /_shinylive/convert.R</span>

<span class="k">WORKDIR</span><span class="s"> /_shinylive</span>

<span class="k">ENTRYPOINT</span><span class="s"> ["Rscript", "/_shinylive/convert.R"]</span>
</code></pre></div></div>

<h2 id="exporting-to-shinylive">Exporting to Shinylive</h2>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># build the docker image</span>
docker build <span class="nt">-t</span> shinylive <span class="nb">.</span>

<span class="c"># export app.R</span>
docker run <span class="nt">-v</span> <span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span>:/app shinylive
</code></pre></div></div>

<p>Once the <code class="language-plaintext highlighter-rouge">docker run</code> command finishes, you’ll see a <code class="language-plaintext highlighter-rouge">site/</code> directory.</p>

<p>Then, you can run the exported app with any HTTP server. If you have R, you can use <code class="language-plaintext highlighter-rouge">httpuv</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># install httpuv</span>
Rscript <span class="nt">-e</span> <span class="s1">'install.packages("httpuv", repos="https://cran.rstudio.com")'</span>

<span class="c"># run httpuv - and open the printed URL</span>
Rscript <span class="nt">-e</span> <span class="s1">'httpuv::runStaticServer("site/")'</span>
</code></pre></div></div>

<p>If you have a Python installation:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># open: http://localhost:8000</span>
python <span class="nt">-m</span> http.server 8000 <span class="nt">--directory</span> site
</code></pre></div></div>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[I was playing around with Shinylive but encounter some issues when using it, here’s the code that worked. It uses Docker to make it reproducible.]]></summary></entry><entry><title type="html">Minifying HTML for GPT-4o: Remove all the HTML Tags</title><link href="https://blancas.io/blog/html-minify-for-llm/" rel="alternate" type="text/html" title="Minifying HTML for GPT-4o: Remove all the HTML Tags" /><published>2024-09-02T00:00:00+00:00</published><updated>2024-09-02T00:00:00+00:00</updated><id>https://blancas.io/blog/html-minify-for-llm</id><content type="html" xml:base="https://blancas.io/blog/html-minify-for-llm/"><![CDATA[<p>tl;dr; if you want to pass HTML data to GPT-4o, just strip out all the HTML and pass raw text, it’s cheaper and there is little to no performance degradation. <a href="#final-comments">Source code and demo available.</a></p>

<p><img src="/assets/images/html-minify-for-llm/cost_vs_accuracy_comparison.png" alt="" /></p>

<p>Following up my <a href="/blog/ai-web-scraper">earlier post</a> on using GPT-4o for web scraping (and finding out how expensive it is!), I wanted to investigate approaches to lower the cost.</p>

<p>My hypothesis was that the document’s structure would have an important effect when trying to extract structured data and that I’d see an important cost vs. accuracy trade-off: by stripping out structure from the HTML document, I was expecting an important degradation in accuracy but this turned out to be false: <em>GPT-4o doesn’t need any HTML structure to correctly extract data.</em></p>

<p>I used the <a href="https://en.wikipedia.org/wiki/Mercury_Prize">Mercury Prize</a> Wikipedia page as input data; this page has a reasonable size and it contains a long table with multiple entities (years, artists, albums, nominees), but most importantly, it’s a fun dataset to work with.</p>

<h2 id="experimental-setup-questions">Experimental setup: questions</h2>

<p>Since I wanted to test to what extent the HTML structure would have an effect on extraction quality, I asked GPT-4o two types of questions:</p>

<ol>
  <li><strong>Unstructured:</strong> the information to correctly answer is included in the document’s paragraphs and the answer is a string</li>
  <li><strong>Structured:</strong> the information to correctly answer is included in the table and the answer is structured (a list of strings)</li>
</ol>

<p>I asked 20 questions in total, 10 unstructured and 10 structured.</p>

<p>I varied the complexity of the questions. For the unstructured case it was pretty limited since there isn’t much wiggle room and I didn’t want to ask questions that
involved math (as evaluating math capabilities is not the purpose of this experiment).</p>

<p>However, the structured case gave me more space to experiment. Here are some sample
questions:</p>

<blockquote>
  <p>Give me the years for the 1st, 4th and 8th editions (in order)</p>
</blockquote>

<p>The answer involves understanding the table structure and order.</p>

<blockquote>
  <p>Extract the shortlisted nominees (include the winner) for the 25th edition, only the artist names (they appear first, followed by the album)”</p>
</blockquote>

<p>This answer also involves structure understanding: it first has to find the row for the 25th edition, then extract data from two columns (winner and nominees are in two separate columns) and then split data that appears in the same column (<code class="language-plaintext highlighter-rouge">Artist - Album</code>):</p>

<p><img src="/assets/images/html-minify-for-llm/25th-edition.png" alt="25th-edition" /></p>

<p>You can see all the questions in the <a href="https://github.com/edublancas/posts/tree/main/html-minify-for-llm">source code.</a></p>

<h2 id="experimental-setup-pre-processing">Experimental setup: pre-processing</h2>

<p>Next, I developed a couple of text pre-processing pipelines that transform the HTML document: the objective is to reduce the number of tokens to lower the cost (as OpenAI charges per token). I tried the following pre-processing pipelines:</p>

<ol>
  <li>No processing: the HTML document is passed as-is to the model (the most expensive approach!)</li>
  <li>Clean HTML: excludes everything outside the <code class="language-plaintext highlighter-rouge">&lt;body&gt;&lt;/body&gt;</code> tags, removes all attributes from HTML tags (except <code class="language-plaintext highlighter-rouge">class</code>, <code class="language-plaintext highlighter-rouge">id</code>, and <code class="language-plaintext highlighter-rouge">data-testid</code>), replaces <code class="language-plaintext highlighter-rouge">class</code> and <code class="language-plaintext highlighter-rouge">id</code> with increasing numbers (1, 2, 3, etc.), cleans up whitespace, and replaces <code class="language-plaintext highlighter-rouge">&lt;a&gt;TEXT&lt;/a&gt;</code> with <code class="language-plaintext highlighter-rouge">TEXT</code></li>
  <li>HTML remover: completely removes all HTML and only keeps the text</li>
  <li>Converts the HTML into <a href="https://github.com/matthewwithanm/python-markdownify">markdown</a> (I added this because some people recommended it on X/Twitter - LLMs are trained with a lot of markdown, hence, they’re expected to understand their structure)</li>
</ol>

<h2 id="experimental-setup-prompts">Experimental setup: prompts</h2>

<p>Here are the functions that I used to call GPT-4o and GPT-4o mini.</p>

<h3 id="unstructured">Unstructured</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">answer_question</span><span class="p">(</span><span class="o">*</span><span class="p">,</span> <span class="n">html_content</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">model</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="n">SYSTEM_PROMPT</span> <span class="o">=</span> <span class="s">"""
You're an expert question-answering system. You're given a snippet of HTML content
and a question. You need to answer the question based on the HTML content. Your response should be a plain text answer to the question based on the HTML content. Your
answer should be concise and to the point.
    """</span>

    <span class="n">completion</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="n">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
            <span class="p">{</span>
                <span class="s">"role"</span><span class="p">:</span> <span class="s">"system"</span><span class="p">,</span>
                <span class="s">"content"</span><span class="p">:</span> <span class="n">SYSTEM_PROMPT</span><span class="p">.</span><span class="n">strip</span><span class="p">(),</span>
            <span class="p">},</span>
            <span class="p">{</span>
                <span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span>
                <span class="s">"content"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"HTML Content: </span><span class="si">{</span><span class="n">html_content</span><span class="si">}</span><span class="se">\n\n</span><span class="s">Question: </span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
            <span class="p">},</span>
        <span class="p">],</span>
    <span class="p">)</span>

    <span class="k">return</span> <span class="n">completion</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>

</code></pre></div></div>

<h3 id="structured">Structured</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ParsedColumn</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">values</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>


<span class="k">def</span> <span class="nf">parse_column</span><span class="p">(</span><span class="o">*</span><span class="p">,</span> <span class="n">html_content</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">model</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="n">SYSTEM_PROMPT</span> <span class="o">=</span> <span class="s">"""
You're an expert web scraper. You're given the HTML contents of a table, a user
query and you have to extract a column from it that is related to the user query.

The name of the column should be the header of the column. The values should be the
text content of the cells in the column.
    """</span>

    <span class="n">completion</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">beta</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="n">parse</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
            <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"system"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">SYSTEM_PROMPT</span><span class="p">},</span>
            <span class="p">{</span>
                <span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span>
                <span class="s">"content"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"HTML Content: </span><span class="si">{</span><span class="n">html_content</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
            <span class="p">},</span>
            <span class="p">{</span>
                <span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span>
                <span class="s">"content"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"User Query: </span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
            <span class="p">},</span>
        <span class="p">],</span>
        <span class="n">response_format</span><span class="o">=</span><span class="n">ParsedColumn</span><span class="p">,</span>
    <span class="p">)</span>

    <span class="n">event</span> <span class="o">=</span> <span class="n">completion</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">parsed</span>
    <span class="k">return</span> <span class="n">event</span><span class="p">.</span><span class="n">model_dump</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="model-evaluation">Model evaluation</h2>

<p>I considered answers to the <strong>Unstructured</strong> questions to be correct when they contain the answer. Let’s see a sample question:</p>

<blockquote>
  <p>Which artist has been nominated the most times for the Mercury Prize without winning?</p>
</blockquote>

<p>Any of these is considered a correct answer:</p>

<ol>
  <li>Radiohead</li>
  <li>Radiohead is the band that has been nominated the most with no wins</li>
  <li>The answer is Radiohead</li>
</ol>

<p>In the <strong>Structured</strong> questions there were two cases: in some, the answer’s order did not matter, but in others, it did.</p>

<p>Here’s a sample question whose answer didn’t require ordering:</p>

<blockquote>
  <p>Extract the shortlisted artists (exclude the winner) for 2015. Only artist names (artists appear first, followed by the album)</p>
</blockquote>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">answer</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"Aphex Twin"</span><span class="p">,</span>
        <span class="s">"Gaz Coombes"</span><span class="p">,</span>
        <span class="s">"C Duncan"</span><span class="p">,</span>
        <span class="s">"Eska"</span><span class="p">,</span>
        <span class="s">"Florence and the Machine"</span><span class="p">,</span>
        <span class="s">"Ghostpoet"</span><span class="p">,</span>
        <span class="s">"Róisín Murphy"</span><span class="p">,</span>
        <span class="s">"Slaves"</span><span class="p">,</span>
        <span class="s">"Soak"</span><span class="p">,</span>
        <span class="s">"Wolf Alice"</span><span class="p">,</span>
        <span class="s">"Jamie xx"</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">assert</span> <span class="nb">set</span><span class="p">(</span><span class="n">answer_gpt4</span><span class="p">)</span> <span class="o">==</span> <span class="n">answer</span>
</code></pre></div></div>

<p>Here’s an example of a question whose answer required ordering:</p>

<blockquote>
  <p>Extract the winners of the Mercury Prize from 1992 to 1995, in order</p>
</blockquote>

<p>Expected answer:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">answer</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">"Primal Scream – Screamadelica"</span><span class="p">,</span>
    <span class="s">"Suede – Suede"</span><span class="p">,</span>
    <span class="s">"M People – Elegant Slumming"</span><span class="p">,</span>
    <span class="s">"Portishead – Dummy"</span><span class="p">,</span>
<span class="p">]</span>

<span class="k">assert</span> <span class="n">answer_gpt4</span> <span class="o">==</span> <span class="n">answer</span>
</code></pre></div></div>

<h2 id="results">Results</h2>

<p><img src="/assets/images/html-minify-for-llm/cost_vs_accuracy_comparison.png" alt="" /></p>

<p><img src="/assets/images/html-minify-for-llm/accuracy_comparison.png" alt="" /></p>

<h3 id="unstructured-1">Unstructured</h3>

<p>When asking <strong>unstructured</strong> questions, GPT-4o and its mini version have similar performance and the pre-processing doesn’t make a difference. Since the price gap is big, <strong>I recommend using GPT-4o mini for unstructured questions with all the HTML removed to maximize savings.</strong></p>

<h3 id="structured-1">Structured</h3>

<p><strong>Structured</strong> questions paint a fairly different scenario: GPT-4o has considerably higher performance than the mini version. However, we see that pre-processing has little to no effect on accuracy. Given the price difference between models, <strong>I recommend testing both with a sample of your data and deciding if the accuracy gains justify the steep price increase. In both cases, you can remove all the HTML tags to reduce the price.</strong></p>

<h3 id="raw-results">Raw results</h3>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th>input</th>
      <th>cost</th>
      <th>accuracy</th>
      <th>question_type</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>gpt-4o-mini</td>
      <td>raw</td>
      <td>0.163094</td>
      <td>0.8</td>
      <td>unstructured</td>
    </tr>
    <tr>
      <td>gpt-4o-mini</td>
      <td>clean</td>
      <td>0.052281</td>
      <td>0.8</td>
      <td>unstructured</td>
    </tr>
    <tr>
      <td>gpt-4o-mini</td>
      <td>unstructured</td>
      <td>0.017891</td>
      <td>0.9</td>
      <td>unstructured</td>
    </tr>
    <tr>
      <td>gpt-4o-mini</td>
      <td>markdown</td>
      <td>0.066414</td>
      <td>0.8</td>
      <td>unstructured</td>
    </tr>
    <tr>
      <td>gpt-4o-mini</td>
      <td>raw</td>
      <td>0.049740</td>
      <td>0.5</td>
      <td>structured</td>
    </tr>
    <tr>
      <td>gpt-4o-mini</td>
      <td>clean</td>
      <td>0.014858</td>
      <td>0.3</td>
      <td>structured</td>
    </tr>
    <tr>
      <td>gpt-4o-mini</td>
      <td>unstructured</td>
      <td>0.004851</td>
      <td>0.4</td>
      <td>structured</td>
    </tr>
    <tr>
      <td>gpt-4o-mini</td>
      <td>markdown</td>
      <td>0.027072</td>
      <td>0.1</td>
      <td>structured</td>
    </tr>
    <tr>
      <td>gpt-4o-2024-08-06</td>
      <td>raw</td>
      <td>2.718225</td>
      <td>0.9</td>
      <td>unstructured</td>
    </tr>
    <tr>
      <td>gpt-4o-2024-08-06</td>
      <td>clean</td>
      <td>0.871350</td>
      <td>0.9</td>
      <td>unstructured</td>
    </tr>
    <tr>
      <td>gpt-4o-2024-08-06</td>
      <td>unstructured</td>
      <td>0.298175</td>
      <td>0.9</td>
      <td>unstructured</td>
    </tr>
    <tr>
      <td>gpt-4o-2024-08-06</td>
      <td>markdown</td>
      <td>1.106900</td>
      <td>0.9</td>
      <td>unstructured</td>
    </tr>
    <tr>
      <td>gpt-4o-2024-08-06</td>
      <td>raw</td>
      <td>0.829000</td>
      <td>0.8</td>
      <td>structured</td>
    </tr>
    <tr>
      <td>gpt-4o-2024-08-06</td>
      <td>clean</td>
      <td>0.247625</td>
      <td>0.7</td>
      <td>structured</td>
    </tr>
    <tr>
      <td>gpt-4o-2024-08-06</td>
      <td>unstructured</td>
      <td>0.080850</td>
      <td>0.7</td>
      <td>structured</td>
    </tr>
    <tr>
      <td>gpt-4o-2024-08-06</td>
      <td>markdown</td>
      <td>0.451200</td>
      <td>0.7</td>
      <td>structured</td>
    </tr>
  </tbody>
</table>

<h2 id="final-comments">Final comments</h2>

<p>Until GPT-4o becomes cheaper, data extraction tasks require some careful evaluation
to avoid breaking the bank. You might be just fine with GPT-4o mini for some cases
but GPT-4o’s performance is much better for others, so evaluate for your use case.</p>

<p>Models have inherent randomness but I didn’t include accuracy ranges in the results
as that’d involve a higher OpenAI bill (check out my <a href="https://ploomber.io/">startup</a>,
if you become a customer, I’ll be able to justify a higher budget for these
experiments!). But I doubt that repeating the experiments would flip the conclusions.</p>

<p>If you want to run the benchmark, here’s the <a href="https://github.com/edublancas/posts/tree/main/html-minify-for-llm">source code</a>. If
you want to play with the pre-processing pipelines, try this demo app: <a href="https://orange-sea-7185.ploomberapp.io">https://orange-sea-7185.ploomberapp.io</a>, it’ll allow
you to enter a URL and estimate the savings.</p>

<p>If you have questions, ping me on <a href="https://x.com/edublancas">X</a>.</p>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[tl;dr; if you want to pass HTML data to GPT-4o, just strip out all the HTML and pass raw text, it’s cheaper and there is little to no performance degradation. Source code and demo available.]]></summary></entry><entry><title type="html">Using GPT-4o for web scraping</title><link href="https://blancas.io/blog/ai-web-scraper/" rel="alternate" type="text/html" title="Using GPT-4o for web scraping" /><published>2024-08-28T00:00:00+00:00</published><updated>2024-08-28T00:00:00+00:00</updated><id>https://blancas.io/blog/ai-web-scraper</id><content type="html" xml:base="https://blancas.io/blog/ai-web-scraper/"><![CDATA[<p>tl;dr; show me the <a href="#conclusions-and-demo">demo and source code!</a></p>

<p><img src="/assets/images/ai-web-scraper/app.png" alt="app" /></p>

<p>I’m pretty excited about the new <a href="https://platform.openai.com/docs/guides/structured-outputs">structured outputs</a>
feature in OpenAI’s API so I took it for a spin and developed an AI-assisted web scraper. This post summarizes my learnings.</p>

<h2 id="asking-gpt-4o-to-scrape-data">Asking GPT-4o to scrape data</h2>

<p>The first experiment was to straight ask GPT-4o to extract the data from an HTML
string, so I used the new structured outputs feature with the following <a href="https://docs.pydantic.dev/latest/">Pydantic</a> models:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Dict</span>

<span class="k">class</span> <span class="nc">ParsedColumn</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">values</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>


<span class="k">class</span> <span class="nc">ParsedTable</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">columns</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">ParsedColumn</span><span class="p">]</span>
</code></pre></div></div>

<p>The system prompt is:</p>

<blockquote>
  <p>You’re an expert web scraper. You’re given the HTML contents of a table and you have to extract structured data from it.</p>
</blockquote>

<p>Here are some interesting things I found when parsing different tables.</p>

<p><em>Note:</em> I also tried GPT-4o mini but yielded significantly worse results so I just continued my experiments with GPT-4o.</p>

<h2 id="parsing-complex-tables">Parsing complex tables</h2>

<p><img src="/assets/images/ai-web-scraper/image.png" alt="alt text" /></p>

<p>After experimenting with some simple tables, I wanted to see how the model would do with a more complex ones, so I passed a 10-day <a href="https://weather.com">weather</a> forecast from Weather.com. The table
contains a big row for at the top and smaller rows for the other 9
days. Interestingly, GPT-4o was able to parse this correctly:</p>

<p><img src="/assets/images/ai-web-scraper/image-1.png" alt="alt text" /></p>

<p>For the 9 remaining days, the table shows a day and a night forecast (see screenshot above). The model correctly parsed such data and added a <code class="language-plaintext highlighter-rouge">Day/Night</code> column. Here’s how it looks like in the browser (note that to display this, we need to click on the button to the right of each row):</p>

<p><img src="/assets/images/ai-web-scraper/image-2.png" alt="alt text" /></p>

<p>At first, I thought that the parsed <code class="language-plaintext highlighter-rouge">Condition</code> column was a hallucination since I did not see that in the website, however, upon inspecting the source code, I realized that those tags exist but are invisible in the table.</p>

<h2 id="combined-rows-break-the-model">Combined rows break the model</h2>

<p>When thinking where to find <em>easy tables</em>, my first thought was <em>Wikipedia</em>. Turns out, a <em>simple</em> table from Wikipedia (<a href="https://en.wikipedia.org/wiki/Human_Development_Index">Human development index</a>) breaks the model because rows with repeated values are merged:</p>

<p><img src="/assets/images/ai-web-scraper/image-3.png" alt="alt text" /></p>

<p>And while the model is able to retrieve individual columns (as instructed by the system prompt), they don’t have the same size, hence, I’m unable to represent the data as a table.</p>

<p>I tried modifying the system prompt with the following:</p>

<blockquote>
  <p>Tables might collapse rows into a single row. If that’s the case, extract the collapsed row as multiple JSON values to ensure all columns contain the same number of rows.</p>
</blockquote>

<p>But it didn’t work. I have yet to try modifying the system prompt
to tell the model to extract rows instead of columns.</p>

<h2 id="asking-gpt-4o-to-return-xpaths">Asking GPT-4o to return XPaths</h2>

<p>Running an OpenAI API call every time can become very expensive, so I figured I’d ask the model to return <a href="https://developer.mozilla.org/en-US/docs/Web/XPath">XPaths</a> instead of
the parsed data. This would allow me to scrape the same page (e.g., to fetch updated data) without breaking the bank.</p>

<p>After some tweaks, I came up with this prompt:</p>

<blockquote>
  <p>You’re an expert web scraper.</p>

  <p>The user will provide the HTML content and the column name.
Your job is to come up with an XPath that will return all elements of that column.</p>

  <p>The XPath should be a string that can be evaluated by Selenium’s
<code class="language-plaintext highlighter-rouge">driver.find_elements(By.XPATH, xpath)</code> method.</p>

  <p>Return the full matching element, not just the text.</p>
</blockquote>

<p>Unfortunately, this didn’t work well. Sometimes, the model would return invalid XPaths (although
this was alleviated with the sentence that mentions Selenium) or XPaths that would
return incorrect data or no data at all.</p>

<h2 id="combining-the-two-approaches">Combining the two approaches</h2>

<p>My next attempt was to combine both approaches: once the model extracted the data,
we could use it as a reference to ask the model for the XPath. <em>This worked much better than straight asking for XPaths!</em></p>

<p>I noticed that sometimes the generated XPath would return no data at all so I added
some dumb retry logic: if the XPath returns no results, try again. This did the trick for
the tables I tested.</p>

<p>However, I noticed a new issue: sometimes the first step (extract data) converted images into text (e.g. an arrow pointing upwards might appear in the
extracted data as “arrow-upwards”), this caused the second step to fail since it’d look for data that wasn’t there. I did not attempt to fix this problem.</p>

<h2 id="gpt-4o-is-very-expensive">GPT-4o is very expensive</h2>

<p><img src="/assets/images/ai-web-scraper/image-4.png" alt="alt text" /></p>

<p>Scraping with GPT-4o can become very expensive since even small HTML tables can contain lots of characters. I’ve been experimenting for two days and I’ve already spent $24!</p>

<p>To reduce the cost, I added some clean up logic to remove unnecessary data from the HTML string before passing it to the model. A simple function that removes all properties except <code class="language-plaintext highlighter-rouge">class</code>, <code class="language-plaintext highlighter-rouge">id</code>, and <code class="language-plaintext highlighter-rouge">data-testid</code> (which are the ones I noticed the generated XPaths were using) trimmed the number of characters in the table by half.</p>

<p>I didn’t see any performance degradations and my suspicion is that the results would actually improve extraction quality.</p>

<p>Currently, the second step (generate XPaths) makes one model call per column in
the table, another improvement could be to generate more than one XPath, I have yet
to try this approach and evaluate performance.</p>

<h2 id="conclusions-and-demo">Conclusions and demo</h2>

<p>I was surprised by the extraction quality of GPT-4o (but then sadly surprised when I looked at how much I’d have to pay OpenAI!). Nonetheless, this was a fun experiment and I definitely see potential for AI-assisted web scraping tools.</p>

<p>I did a quick demo using Streamlit, you can check it out here: <a href="https://orange-resonance-9766.ploomberapp.io">https://orange-resonance-9766.ploomberapp.io</a>, the source code is on <a href="https://github.com/edublancas/posts/tree/main/ai-web-scraping">GitHub</a> (Spoiler: don’t expect anything polished).</p>

<p>I wanted to test more tables; however, since that’d involve a higher OpenAI bill, I only tried a handful of them. (check out my <a href="https://ploomber.io/">startup</a>,
if you become a customer, I’ll be able to justify a higher budget for these
experiments!).</p>

<p>Some stuff I’d like to try if I had more time:</p>

<ol>
  <li>Capture browser events: the current demo is a one-off process: users enter the URL and an initial XPath. This isn’t great UX as it’d be better to ask the user to click on the table they want to extract, and to provide some sample rows so the model can understand the structure a bit better.</li>
  <li>In complex tables, a single XPath might not be enough to extract a full column, I’d like to see if asking the LLM to return a program (e.g. Python) would work.</li>
  <li>More experimenting with the HTML clean up is needed. It’s very expensive to use GPT-4o and I feel like I’m passing a lot of unnecessary data to the model</li>
</ol>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[tl;dr; show me the demo and source code!]]></summary></entry><entry><title type="html">Don’t make users read your docs</title><link href="https://blancas.io/blog/users-and-docs/" rel="alternate" type="text/html" title="Don’t make users read your docs" /><published>2022-07-23T00:00:00+00:00</published><updated>2022-07-23T00:00:00+00:00</updated><id>https://blancas.io/blog/users-and-docs</id><content type="html" xml:base="https://blancas.io/blog/users-and-docs/"><![CDATA[<p>As an <a href="https://github.com/ploomber/ploomber">open-source maintainer</a>, I always put effort into documenting all known edge cases so that users know how to fix problems. So, whenever users report incompatibilities, we highlight them in our documentation. Still, I realized this approach wasn’t working when users came to our Slack asking for help with problems we had already documented.</p>

<p>As project maintainers, we tend to be overly optimistic about how good the documentation is. But the target metric should not be how detailed our documentation is but how fast users can get things done. And when things go wrong, reading the documentation is not always the quickest route, so <em>don’t make your users read your docs, help them right on the spot.</em></p>

<h2 id="motivating-example">Motivating example</h2>

<p>A few weeks ago, a user <a href="https://github.com/ploomber/ploomber/issues/882">reported an issue</a>. The details are not important, but it required us to add a new argument to a class. We added the argument to the constructor, documented it, and posted the solution in the GitHub issue; however, when thinking about what would happen if a new user had the same issue, I realized we solved the problem for one user but not the rest. Most likely, other users would have a hard time trying to fix the issue, and most likely, they’d give up if they didn’t find the answer quickly.</p>

<h2 id="useful-error-messages">Useful error messages</h2>

<p>A helpful error message tells you three things:</p>

<ol>
  <li>What failed</li>
  <li>Why it failed</li>
  <li>How to fix it</li>
</ol>

<p>For example:</p>

<blockquote>
  <p>RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method</p>
</blockquote>

<p>This error message contains the three elements:</p>

<ol>
  <li>Cannot re-initialize CUDA [What failed]</li>
  <li>…in forked process [Why it failed]</li>
  <li>Use the ‘spawn’ start method [How to fix it]</li>
</ol>

<p>The problem is that our framework builds an abstraction, so users don’t have to use the <code class="language-plaintext highlighter-rouge">multiprocessing</code> module directly; hence, the user couldn’t fix the issue unless they modified the source code.</p>

<p>In our specific use case, here’s a better error message:</p>

<blockquote>
  <p>RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, Pass ‘spawn’ to the ‘start_method’ argument of the Parallel executor constructor</p>
</blockquote>

<p>Let’s see how to achieve this.</p>

<h2 id="helpful-error-messages">Helpful error messages</h2>

<p><em>Note: the following sections contain Python code snippets, but the idea applies to any language.</em></p>

<p>We want to anticipate the error and tell the user how to get things running:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">some_package.exceptions</span> <span class="kn">import</span> <span class="n">SomeException</span>

<span class="k">def</span> <span class="nf">thing_that_breaks</span><span class="p">(</span><span class="n">argument</span><span class="p">):</span>
    <span class="p">...</span>


<span class="k">def</span> <span class="nf">thing_that_the_user_calls</span><span class="p">(</span><span class="n">argument</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">thing_that_breaks</span><span class="p">(</span><span class="n">argument</span><span class="o">=</span><span class="n">argument</span><span class="p">)</span>
    <span class="k">except</span> <span class="n">SomeException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="c1"># add more context and raise whatever exception type makes sense
</span>        <span class="k">raise</span> <span class="nb">RuntimeError</span><span class="p">(</span><span class="s">'How to fix it'</span><span class="p">)</span> <span class="k">from</span> <span class="n">e</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="c1"># raise the original exception, unmodified
</span>        <span class="k">raise</span>
    <span class="p">...</span>
</code></pre></div></div>

<p><em>Note:</em> the <code class="language-plaintext highlighter-rouge">raise exception from another_exception</code> expression is called a <a href="https://peps.python.org/pep-3134/">chained exception</a> in Python.</p>

<p>The previous snippet will provide the user-specific instructions when encountering the problem using our software.</p>

<p>However, we’re assuming that:</p>

<ol>
  <li>We can import <code class="language-plaintext highlighter-rouge">some_package.exceptions</code> in our project’s codebase (which implies adding it as a dependency)</li>
  <li>We are sure that when <code class="language-plaintext highlighter-rouge">SomeException</code> is raised, the solution is what we are displaying to the user</li>
</ol>

<p>Sometimes exceptions are too general, so we need to dig deeper. In such cases, we can use the error message as a proxy:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">thing_that_breaks</span><span class="p">(</span><span class="n">argument</span><span class="p">):</span>
    <span class="p">...</span>


<span class="k">def</span> <span class="nf">thing_that_the_user_calls</span><span class="p">(</span><span class="n">argument</span><span class="p">):</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="n">thing_that_breaks</span><span class="p">(</span><span class="n">argument</span><span class="o">=</span><span class="n">argument</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">if</span> <span class="s">'some hint'</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">):</span>
            <span class="k">raise</span> <span class="nb">Exception</span><span class="p">(</span><span class="s">'Instructions on how to fix it'</span><span class="p">)</span> <span class="k">from</span> <span class="n">e</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">raise</span>
    <span class="p">...</span>
</code></pre></div></div>

<p>There are obvious drawbacks to this approach: the error message might change, although the same is true for the exception type, so in any case, ensure you have unit tests in place.</p>

<p>I’ve encountered cases where checking the error message isn’t enough, and we might display inaccurate instructions. In such situations, I write an error to reflect that:</p>

<blockquote>
  <p>If having issues with X, try [possible solution]</p>
</blockquote>

<h2 id="the-end">The end</h2>

<p>If you enjoyed this, let’s connect on <a href="https://twitter.com/edublancas">Twitter</a>, where I often post my adventures as open-source maintainer, and if you do Data Science, check out our <a href="https://github.com/ploomber/ploomber">project.</a></p>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[As an open-source maintainer, I always put effort into documenting all known edge cases so that users know how to fix problems. So, whenever users report incompatibilities, we highlight them in our documentation. Still, I realized this approach wasn’t working when users came to our Slack asking for help with problems we had already documented.]]></summary></entry><entry><title type="html">5 signs your Data Science workflow is broken</title><link href="https://blancas.io/blog/ds-broken-workflow/" rel="alternate" type="text/html" title="5 signs your Data Science workflow is broken" /><published>2019-07-16T00:00:00+00:00</published><updated>2019-07-16T00:00:00+00:00</updated><id>https://blancas.io/blog/ds-broken-workflow</id><content type="html" xml:base="https://blancas.io/blog/ds-broken-workflow/"><![CDATA[<p>Developing reproducible data pipelines is hard, but before we even think about reproducibility your project has to meet some minimum standards. This post discusses some recurring bad practices when developing data pipelines and provides some advice to overcome them.</p>

<h2 id="1-lack-of-setup-instructions">1. Lack of setup instructions</h2>

<p>The first step for every software project is to get the environment up a running (e.g., install UNIX package A, then install Python 3.7, then install Python libraries X, Y and Z), however, more often than not, the environment is setup once and instructions are never recorded.</p>

<p>Data Science projects often depend on complex software setups (e.g., installing GPU or database drivers); lack of instructions will surely cause a lot of trouble for the team, especially when a new member joins or when the project is taken to a production environment.</p>

<p>This setup instructions have to be always up to date, they will break if a single new dependency is not registered and unnecessarily complex if any dependency is no longer needed.</p>

<p><strong>How to fix it?</strong> All projects should come with a shell script to setup the project, package managers do the heavy lifting for installing software so you might assume that one is already installed.</p>

<p>To prevent setup instructions become outdated, test them every time your code changes by using a <a href="Continuous Integration">Continuous Integration</a> service such as <a href="https://travis-ci.org/">Travis CI</a>. While CI services can detect when your dependencies no longer work, they cannot detect unnecessary libraries, those you have to remove manually from the setup script.</p>

<h2 id="2-environment-configuration-embedded-in-the-source-code">2. Environment configuration embedded in the source code</h2>

<p>If you keep seeing this error message when running your pipeline:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s2">"/Users/coworkersname/data/clean_v2.parquet"</span> file not found.
</code></pre></div></div>

<p>It is probably because someone in the team hardcoded a path to a file/directory that only exists in their machine. Even if you are working in a shared filesystem, it is a good idea to keep files separate to prevent accidentally overwriting their work. <strong>Explicit paths should never make it to the code.</strong></p>

<p><strong>How to fix it?</strong> Keep all things such as I/O paths and host addresses in a separate place and read from there. For example, you might have a file like this in your project’s root directory:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># locations.yaml</span>
<span class="na">data</span><span class="pi">:</span>
    <span class="c1"># all raw data goes here</span>
    <span class="na">raw</span><span class="pi">:</span> <span class="s">~/project/data/raw/</span>
    <span class="c1"># all processed data goes here</span>
    <span class="na">processed</span><span class="pi">:</span> <span class="s">~/project/data/processed</span>

<span class="c1"># host to the database</span>
<span class="na">db</span><span class="pi">:</span> <span class="s">db.organization.com:5421/database</span>
</code></pre></div></div>

<p>Everyone then should treat that file as a <em>contract</em>: you must read and write only to these directories. Each member can customize their configuration file and nothing should break. Your code will look like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">my_project</span> <span class="kn">import</span> <span class="n">locations</span>


<span class="k">def</span> <span class="nf">clean_data</span><span class="p">():</span>
    <span class="c1"># load content of locations.yaml
</span>    <span class="n">path_raw</span> <span class="o">=</span> <span class="n">locations</span><span class="p">[</span><span class="s">'data'</span><span class="p">][</span><span class="s">'raw'</span><span class="p">]</span>
    <span class="n">path_clean</span> <span class="o">=</span> <span class="n">locations</span><span class="p">[</span><span class="s">'data'</span><span class="p">][</span><span class="s">'processed'</span><span class="p">]</span>

    <span class="c1"># read a file relative to the raw data folder...
</span>    <span class="n">pd</span><span class="p">.</span><span class="n">read_parquet</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="n">path_raw</span><span class="p">,</span> <span class="s">'dataset.parquet'</span><span class="p">))</span>

    <span class="c1"># clean the data...
</span>
    <span class="c1"># write to a file relative to the clean data folder...
</span>    <span class="n">pd</span><span class="p">.</span><span class="n">to_parquet</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="n">path_clean</span><span class="p">,</span> <span class="s">'dataset.parquet'</span><span class="p">))</span>
</code></pre></div></div>

<p>Make sure the file is easily discoverable inside your scripts, you might want to create a function that automatically finds a <code class="language-plaintext highlighter-rouge">locations.yaml</code> file in the current working directory or any parent folders up to certain levels and raises and <code class="language-plaintext highlighter-rouge">Exception</code> if it cannot find one.</p>

<h2 id="3-end-to-end-pipeline-execution-requires-manual-intervention">3. End-to-end pipeline execution requires manual intervention</h2>

<p>A pipeline is not such if it needs manual intervention to run. Given the raw data, you should be able to run the pipeline end-to-end with a single command. For starters, that means you should only use scripting tools such as Python or R, and no GUI tool such as Excel.</p>

<p>Automated execution is a prerequisite for automated testing. Bugs are inevitable, but automated testing can save you from finding those bugs in a production environment.</p>

<p><strong>How to fix it?</strong> If setup instructions are provided and there are not hardcoded paths, having an automated pipeline will be easier. As with setup instructions, the only reliable way to keep this working is to include a shell script in the CI service to make sure your pipeline still runs. If you are working with large datasets, you may want to pass a sample of the data for testing purposes.</p>

<h2 id="4-intermediate-results-are-shared-over-e-mailcloud-storage">4. Intermediate results are shared over e-mail/cloud storage</h2>

<p>A (unfortunate) common practice in many data analysis projects is to share intermediate results. Reasons vary but the pattern goes like this: member <code class="language-plaintext highlighter-rouge">A</code> updates some code in the pipeline that <code class="language-plaintext highlighter-rouge">B</code> needs as input, so <code class="language-plaintext highlighter-rouge">A</code> runs the updated code and shares the new results with <code class="language-plaintext highlighter-rouge">B</code>, who then uses this new file as input instead of the old version.</p>

<p>Sharing intermediate results is a terrible practice since it makes reproducibility harder. <strong>Intermediate results should never be shared</strong>, <code class="language-plaintext highlighter-rouge">A</code> should just push the new code and <code class="language-plaintext highlighter-rouge">B</code> should execute it to get the new input to use.</p>

<p><strong>How to fix it?</strong> Fixing this pattern is harder, all previous sections are prerequisites for this one, namely:</p>

<ol>
  <li>There should a setup script to configure setup the environment</li>
  <li>Configuration should be centralized in a single file, out of the source code</li>
  <li>There should be a script to execute the pipeline end-to-end</li>
</ol>

<p>If all those requisites are met, there is no need to share intermediate files.</p>

<p>The only situation when sharing intermediate files might be necessary is when any of your tasks either a) takes <em>a lot</em> to run or b) it has to be run in a restricted environment (e.g., a shared cluster). In such case, special care should be given to make sure that the code that produced some results is appropriately stored in version control. <strong>Avoid this situation as much as possible.</strong></p>

<p>For most projects, this should not be the case. If you are working with large datasets, you probably already have some distributed infrastructure which makes your computationally heavy scripts run in a reasonable amount of time, if they do not, consider splitting them in smaller steps.</p>

<h2 id="5-a-change-in-a-single-step-requires-you-to-execute-the-pipeline-end-to-end">5. A change in a single step requires you to execute the pipeline end-to-end</h2>

<p>During development, it is always the case that steps are revisited (added features, fixed bugs). Every time you make a change, you have to make sure that all changes propagate to steps downstream. Since steps in a data pipeline often take minutes or even hours to run, an update should only trigger execution on their downstream dependencies to avoid wasteful computations.</p>

<p>If there is no way for your pipeline to know which steps are affected by any given update you only have two choices: either to run the entire pipeline again or manually check which steps have to be run. Both options are a waste of your time.</p>

<p><strong>How to fix it?</strong> There is not a single answer here. I have not found any library to easily fix this issue (I implemented my own solution but it is not publicly available yet). If all your processing is done locally, my recommendation is to use <a href="https://en.wikipedia.org/wiki/Make_(software)">Make</a>.</p>

<h2 id="final-comments">Final comments</h2>

<p>I hope this post helps you find areas for improvement in your data projects. Putting attention to this issues will pay off in the long run. A working workflow not only will increase your productivity to get your analysis right faster but will help you build more robust data products.</p>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[Developing reproducible data pipelines is hard, but before we even think about reproducibility your project has to meet some minimum standards. This post discusses some recurring bad practices when developing data pipelines and provides some advice to overcome them.]]></summary></entry><entry><title type="html">The case against data versioning</title><link href="https://blancas.io/blog/ds-versioning-data/" rel="alternate" type="text/html" title="The case against data versioning" /><published>2019-06-27T00:00:00+00:00</published><updated>2019-06-27T00:00:00+00:00</updated><id>https://blancas.io/blog/ds-versioning-data</id><content type="html" xml:base="https://blancas.io/blog/ds-versioning-data/"><![CDATA[<p>A recent technique to advocate for reproducibility in data analysis is <em>data versioning</em>, which means that some (or all) intermediate files generated by the pipeline are saved and tagged so we can come back to them at any moment. But I think data versioning is actually <em>harmful</em> for reproducibility.</p>

<p>Reproducibility is defined as the <em>“ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. Reproducibility is a minimum necessary condition for a finding to be believable and informative <a href="https://stm.sciencemag.org/content/8/341/341ps12.full">(Source)</a>.”</em>  The key term here is <em>materials</em>. The only materials in a data pipeline are the raw data and the code, all other artifacts are byproducts which should not be considered.</p>

<p>We can test for reproducibility by answering the following question: given the <em>same raw data and code</em>, do we get the same results? Using intermediate results and claim reproducibility is cheating, since we are overlooking all previous computations that originated such interim results.</p>

<p><em>Reproducibility can only be achieved by construction</em>, it is not a feature you add to your pipeline. The bad news is that you cannot do <code class="language-plaintext highlighter-rouge">pip install reproducibility</code>; the only way to achieve it is through better software engineering practices. The good news is that verifying reproducibility is trivial.</p>

<p><strong>Verifying reproducibility</strong></p>

<p>Verifying that a data pipeline is reproducible is as simple as passing <em>raw data</em> and comparing the <em>claimed final output</em>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pipeline</span>

<span class="n">result_final</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="s">'/path/to/raw/data.csv'</span><span class="p">)</span>
<span class="n">result_expected</span> <span class="o">=</span> <span class="mi">42</span>

<span class="k">assert</span> <span class="n">result_final</span> <span class="o">==</span> <span class="n">result_expected</span>
</code></pre></div></div>

<p><strong>Saving intermediate results is useful (for other purposes)</strong></p>

<p>Data pipelines are built from steps that run one after the other, when the final output is unexpected, storing intermediate results makes our pipeline more transparent: we can go inspect those results and identify which step went wrong.</p>

<p>They are also useful for avoiding redundant computation. Pipelines usually take a lot of time to run, doing an end-to-end run every time we make a small change is wasteful. Selectively running the steps affected by the changes should lead us to the same result had we executed the pipeline end-to-end.</p>

<p><strong>Versioning your final output</strong></p>

<p>There is one piece of your pipeline that you can version: the final output. If you want to automate reproducibility verification, then you might want to store the final output to compare with the pipeline’s output whenever a change is introduced. If results do not match, your pipeline is no longer reproducible.</p>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[A recent technique to advocate for reproducibility in data analysis is data versioning, which means that some (or all) intermediate files generated by the pipeline are saved and tagged so we can come back to them at any moment. But I think data versioning is actually harmful for reproducibility.]]></summary></entry><entry><title type="html">Aplicando a una maestría en EE.UU. (II): Seleccionando programas</title><link href="https://blancas.io/blog/study-us-ii/" rel="alternate" type="text/html" title="Aplicando a una maestría en EE.UU. (II): Seleccionando programas" /><published>2019-04-07T00:00:00+00:00</published><updated>2019-04-07T00:00:00+00:00</updated><id>https://blancas.io/blog/study-us-ii</id><content type="html" xml:base="https://blancas.io/blog/study-us-ii/"><![CDATA[<p>En esta segunda parte escribiré sobre qué parámetros tomar en cuenta para elegir los programas a los que aplicarás. Considera unos 10-15 programas en tu búsqueda, para finalmente aplicar a unos 6. Lo más importante es la calidad del programa pero considera que entre más prestigio tenga, más competitiva será la admisión, el resto de los aspectos que mencionaré no están en ningún orden particular, dependerá de ti cuál consideres más importante.</p>

<h2 id="aspectos-a-considerar">Aspectos a considerar</h2>

<p><em>tl;dr; investiga la calidad del programa en diversas fuentes, asegúrate que tu programa tenga eventos de reclutamiento, considera que los costos son muy variables y que necesitarás demostrar solvencia económica para obtener la visa</em></p>

<h3 id="calidad-del-programa">Calidad del programa</h3>

<p>El parámetro más importante es la calidad del programa pero también es el más difícil de evaluar. Usar los rankings es algo engañoso porque hay escuelas que son conocidas solo en algunas áreas, por lo que no figuran en los rankings generales (ejemplo: Carnegie Mellon en Computer Science). Es mejor guiarse por los rankings por área, pero aún en ese caso, puede haber más de un programa en el mismo departamento. Mi recomendación es usar todos los recursos que tengas a tu alcance, y pon especial énfasis en los detalles de cada programa. La mejor forma de obtener información más detallada sobre programas en específico es hablar con algún graduado (LinkedIn es un buen recurso para esto).</p>

<p>Pon atención en cuestiones como: materias obligatorias, oferta de optativas, posibilidad de tomar materias en otros departamento o escuelas cercanas, tamaño del programa (un programa de 30 personas es una experiencia muy diferente a uno de 200), estadísticas de los graduados.</p>

<h3 id="salida-laboral">Salida laboral</h3>

<p>Si tienes interés en hacer una pasantía durante el verano o trabajar temporalmente después de tu programa  (tanto la visa de estudiante F1 como la J1 te permiten hacerlo) es importante que investigues si los programas tienen ferias de reclutamiento; en mi experiencia, esta es una de las mejores formas de obtener una entrevista (la otra es obtener una recomendación de alguien que trabaje ahí). Es importante que investigues por reclutamientos exclusivos para el programa de tu interés, los eventos a nivel escuela no son tan efectivos, va tanta gente que no tienes suficiente tiempo de entablar una conversación con los reclutadores.</p>

<h3 id="costo-del-programa">Costo del programa</h3>

<p>Este punto es el más tedioso, sobre todo si apenas te encuentras buscando opciones, pero informarte con antelación te evitará muchas complicaciones después. El costo anual de colegiatura varía entre universidades y los costos de vida varían aún <em>mucho más</em> de ciudad a ciudad (vivir en NYC es mucho más caro que vivir en Austin). Al momento de solicitar la visa, tendrás que probar solvencia económica a través de una cuenta bancaria, documentos de becas obtenidas, préstamos, etc. Es muy probable que en esta etapa del proceso aún no sepas si se te ha otorgado alguna beca, por lo que si dependes de alguna beca para cubrir los gastos, seleccionar un programa más asequible disminuirá el riesgo de complicaciones si la beca no te es otorgada (en el siguiente artículo me enfocaré en hablar de las opciones de becas, créditos y otras opciones para financiar tu programa).</p>

<p>Pocos programas a nivel maestría ofrecen becas y suelen otorgarlas cuando envían tu carta de aceptación, así que si hay alguno de tu interés que ofrezca beca, tómalo en cuenta. La duración del programa es muy importante porque impacta directamente en el costo, la mayoría son de dos años pero existen también de 3 semestres e incluso de un año. A pesar de que un programa más largo te dará más tiempo para profundizar tu aprendizaje, el aumento en el costo será considerable.</p>

<h3 id="ubicación-de-la-universidad">Ubicación de la universidad</h3>

<p>Además del impacto económico que tiene la ubicación de la universidad, también impactará tu experiencia. En primer lugar debes considerar que algunas universidades están ubicadas en ciudades grandes (Nueva York, Chicago), otras en ciudades no tan grandes (Boston) y otras en donde <a href="https://en.wikipedia.org/wiki/College_town">lo único que hay es la universidad</a>. Ciertamente la calidad de tu programas es mucho más importante que la oferta de entretenimiento en la ciudad, pero al menos haz el ejercicio mental de imaginarte viviendo en tal o cual ciudad y asegúrate que te sentirás cómodo ahí.</p>

<p>Por otro lado, la ubicación también afectará tus prospectos laborales. A pesar de que la mayoría de las solicitudes se realizan por internet y nada te impide irte a trabajar a otra ciudad, la cercanía con la industria hará más sencillo este proceso. Para empezar, las empresas suelen reclutar en las universidades locales, además, si empiezas a buscar trabajo antes de graduarte, será mucho más fácil acudir a las entrevistas si las empresas están en la misma ciudad (la última ronda de entrevistas siempre es en la oficina de la empresa), en caso contrario, tu disponibilidad para tomar las entrevistas estará limitado por el tiempo que puedas ausentarte sin afectar tu rendimiento escolar.</p>

<h2 id="a-cuántos-programas-aplicar">¿A cuántos programas aplicar?</h2>

<p><em>tl;dr; aplica a 6 programas, elige los dos que más quieras, dos no tan competitivos y dos opciones “seguras”</em></p>

<p>Una vez que tengas una lista de unos 10-15 programas, es momento de decidir a cuales aplicarás. Toma en cuenta que la solicitud a cada programa tendrá un costo de 75-100 USD, pero aún más importante es que tendrás que enviar la carta de motivos y cartas de recomendación <em>diferentes</em> a cada programa. ¿Qué tan diferentes? Eso depende de ti. A pesar de que puedes enviar las mismas cartas, considero que es una muy mala estrategia pues demostrará poco interés de tu parte. Como mínimo, deberás incluir en tu carta (y quienes te recomienden en las suyas) el nombre de la universidad y el programa; idealmente, una porción de tu carta de motivos estará dedicada a hablar de los detalles de cada programa para escribir por qué deberían admitirte (pedirle una carta de recomendación diferente por programa es muy complicado, así que como mínimo pide que cambien el nombre).</p>

<p>Por otro lado, entre menos solicitudes envíes, más riesgo habrá de que no quedes en ningún programa (sí, eso pasa, sobre todo en programas competitivos donde los porcentajes de admisión suelen ser de un dígito). Te recomiendo que no apliques a menos de 6 programas; todos ellos deben ser programas en los que si recibes admisión estes bien seguro de que te inscribirás. Es importante que selecciones esos programas de manera que disminuyas el riesgo de no quedarte en ninguno; recomiendo que lo hagas de la siguiente manera: 2 de ellos pueden ser libres (los dos mejores programas en tu área, por ejemplo), otros 2 pueden ser programas que no sean tan competitivos y dos opciones “seguras”. La dificultad radica en cómo evaluar qué tan probable es recibir admisión, en ese caso, lo mejor es hablar con  un <a href="https://educationusa.state.gov/centers/educationusa-advising-center-comexus">experto</a> que pueda evaluar tu perfil.</p>

<h2 id="comentarios-finales">Comentarios finales</h2>

<p>Elegir los programas a los que aplicarás no es sencillo, así que dedícale suficiente tiempo. En la siguiente parte hablaré de las becas y créditos disponibles así como otras opciones para financiar tu programa. Si tienes alguna duda o comentario, no duden en escribirme por Twitter <a href="http://twitter.com/edublancas/">@edublancas</a> o por correo electrónico <a href="mailto:edu@blancas.io">edu@blancas.io</a>.</p>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[En esta segunda parte escribiré sobre qué parámetros tomar en cuenta para elegir los programas a los que aplicarás. Considera unos 10-15 programas en tu búsqueda, para finalmente aplicar a unos 6. Lo más importante es la calidad del programa pero considera que entre más prestigio tenga, más competitiva será la admisión, el resto de los aspectos que mencionaré no están en ningún orden particular, dependerá de ti cuál consideres más importante.]]></summary></entry><entry><title type="html">Aplicando a una maestría en EE.UU. (I): Planeando tu aplicación</title><link href="https://blancas.io/blog/study-us-i/" rel="alternate" type="text/html" title="Aplicando a una maestría en EE.UU. (I): Planeando tu aplicación" /><published>2019-03-30T00:00:00+00:00</published><updated>2019-03-30T00:00:00+00:00</updated><id>https://blancas.io/blog/study-us-i</id><content type="html" xml:base="https://blancas.io/blog/study-us-i/"><![CDATA[<p>Para inaugurar mi blog, he decidido escribir una serie de artículos para aquellos interesados en entrar a algún programa competitivo de maestría en EE.UU. en el área de STEM. Esta serie de artículos contendrá información que fui recopilando de diversas fuentes cuando me encontraba en el proceso, pero también cosas que tuve que aprender en el camino (y que me hubiera sido muy útil saber desde el principio).</p>

<p>Toma en cuenta que estos artículos se basan únicamente en mi experiencia y es imposible dar guías paso a paso debido a que cada escuela tiene criterios diferentes. La mayoría de la información de este primer artículo la obtuve de <a href="https://quora.com">Quora</a> y <a href="https://magoosh.com">Magoosh</a>, te recomiendo busques recursos más detallados en esas páginas.</p>

<p>En esta primera parte hablaré de qué tomar en cuenta si estás considerando aplicar: ya sea que te encuentres en los primeros años de licenciatura, estés a punto de graduarte o estés a unos meses de comenzar el proceso de aplicación.</p>

<p>Lo primero que hay que mencionar es que todos los aspectos de tu aplicación son importantes y la única forma de mejorar tus posibilidades es tener un perfil competitivo.</p>

<h2 id="si-estás-leyendo-esto-durante-los-primeros-años-de-tu-licenciatura">Si estás leyendo esto durante los primeros años de tu licenciatura</h2>

<p>TL; DR Mantén un promedio arriba de 9.2 e involúcrate en actividades académicas (preferentemente con alguna institución en EE.UU.).</p>

<h3 id="tu-promedio-importa">Tu promedio importa</h3>

<p>Si estás cursando estudios de licenciatura y estás considerando una maestría en EE.UU. en un programa <em>altamente competitivo</em>,  tu promedio de licenciatura es algo que debes cuidar. A pesar de que las escuelas indican que “no tienen promedio mínimo requerido”, un promedio bajo puede dejarte fuera (aunque uno alto no garantiza la admisión). En general, considera que un promedio “bueno” para uno de estos programas es de 3.7/4 (equivalente a un 9.2/10). Si tu promedio es menor que eso, no quiere decir que no tienes posibilidades de ser admitido, pero deberás compensarlo en otros aspectos de tu aplicación (con un excelente puntaje en el GRE, por ejemplo).</p>

<p>Dos consideraciones importantes: si provienes de una escuela estricta en sus evaluaciones y algún miembro del comité lo sabe, será un factor que considerán. Es difícil saber qué tan conocida es tu escuela para el comité de admisión, pero puedes investigar si exalumnos de tu escuela se han graduado de los programas de tu interés, o mejor aún, si algun profesor del programa es graduado de tu universidad.</p>

<p>Otro detalle importante es que el comité de admisión dará más importancia a las calificaciones de tu área que a las demás (un 7 te afectara más si fue en cálculo que si fue en literatura).</p>

<h3 id="tus-actividades-fuera-del-salón-también">…tus actividades fuera del salón también</h3>

<p>Una forma de sobresalir entre los aplicantes es demostrar que estás involucrado en tu área fuera del salón de clases. Si tienes oportunidad de involucrarte en proyectos en tu universidad o alguna empresa te puede ayudar mucho  (el verano o un semestre de intercambio son buenas formas para lograrlo). Algo que definitivamente puede hacer la diferencia es si estos proyectos los haces en alguna universidad en EE.UU., mejor aún si es en alguna escuela con prestigio en tu área de interés.</p>

<p>Algunos programas de maestría son enfocados en investigación (tienes que escribir una tesis), esto sucede sobre todo en el área de ciencias (en el área de ingeniería los programas suelen ser más aplicados). Si tu programa requiere escribir una tesis, es importante que enfoques tus actividades académicas en cuestiones de investigación (en vez de hacer proyectos aplicados en una empresa, por ejemplo), mucho mejor si estos proyectos culminan en publicaciones científicas.</p>

<h3 id="estudia-inglés">…estudia inglés</h3>

<p>Este es un punto obvio, pero no quiero dejarlo fuera. Si estás en tus primeros años de licenciatura y no puedes <em>hablar</em> en inglés con fluidez, es importante que te pongas a practicar con tiempo pues el examen de inglés que te pedirán (TOEFL iBT) tiene una sección de <em>speaking</em>.</p>

<h3 id="y-si-ya-me-gradué-o-estoy-a-punto-de-graduarme">¿Y si ya me gradué o estoy a punto de graduarme?</h3>

<p>Si ya te graduaste o estás a punto de hacerlo, será más difícicil subir tu promedio o involucrarte en actividades académicas, por lo que tener un buen puntaje en los exámenes (siguiente sección) es muy importante. Si ya tienes algunos años de haberte graduado, tu experiencia laboral (principalmente si es en el área de la maestría) también puede ayudarte.</p>

<h2 id="los-requisitos-para-la-solicitud-de-admisión">Los requisitos para la solicitud de admisión</h2>

<p>TL;DR Obtén un score en el TOEFL mínimo de 100 puntos y un puntaje cuantitativo en el GRE al menos en el percentil 90.</p>

<p>Prácticamente todos los programas a los que apliques van a pedir los mismos requisitos: GRE, TOEFL iBT, <em>résumé</em>, carta de motivos y cartas de recomendación. Es importante que planees cómo cubrir estos requisitos con suficiente tiempo de antelación. Las aplicaciones son en diciembre, te recomiendo que hagas un primer intento de los exámenes, que comiences a trabajar en la carta de motivos y las cartas de recomendación unos 6 meses antes. Puede parecer mucho tiempo, pero hay muchos factores que estarán fuera de tu control (fechas disponibles para tomar los exámenes, ejemplo) y te aseguro que te tomará más tiempo de lo que planees.</p>

<p>Otro aspecto importante es la selección de programas, también te recomiendo comenzar a investigar unos 6 meses antes, la siguiente entrada  en esta serie se enfocará en eso. Usualmente los programas no tienen puntajes mínimos en los exámenes pero en algunos casos sí hay (por ejemplo, recuerdo haber visto un programa con puntaje mínimo en la sección de <em>speaking</em> del TOEFL iBT), por eso es importante tener una idea de los programas a los que aplicarás por si alguno tiene requisitos de este tipo.</p>

<p>Existe una cantidad enorme de recursos de cómo preparar tu solicitud, así que seré breve y únicamente incluiré los puntos que considero más importantes.</p>

<h3 id="consejos-para-los-exámenes">Consejos para los exámenes</h3>

<h4 id="toefl-ibt">TOEFL iBT</h4>

<p>Dar consejos respecto al TOEFL es muy difícil dado que depende mucho de qué tan preparado estés unos meses antes de aplicar. Idealmente ya tienes un buen nivel y solo quieres dedicarle un poco de tiempo para tener un mejor puntaje. La mejor forma de tener un buen diagnóstico es hacer el examen una vez. Si alcanzas un puntaje muy bueno (&gt;=110), olvídate del TOEFL y enfócate en el GRE, si tienes un puntaje no tan bueno (menos de 100), considera tomar algún curso para incrementar tu puntaje.</p>

<h4 id="gre">GRE</h4>

<p>En programas técnicos, tu puntaje de GRE es un requisito básico (como tu promedio de licenciatura). Las secciones de <em>verbal reasoning</em> y <em>analytical writing</em> son solo un requisito, mientras tengas un puntaje regular (lo regular depende de cada programa), el comité de admisión probablemente no le dé tanta importancia. El puntaje importante es el de <em>quantitative reasoning</em>. Como regla de dedo considera que un puntaje bueno es estar en o por arriba del percentil 90. El mejor recurso que encontré para estudiar para este examen y obtener estadísticas de qué puntajes se consideran buenos es <a href="https://magoosh.com/">Magoosh</a>. De igual forma, te recomiendo tomar el examen unos 6 meses antes para evaluar tu situación y determinar si es necesario estudiar y volver a tomar el examen.</p>

<h4 id="cómo-prepararme">¿Cómo prepararme?</h4>

<p>El único consejo que puedo dar respecto a cómo prepararse es que practiques ambos exámenes en casa con un formato lo más parecido posible a como será cuando lo tomes de verdad: con cronómetro por sección y usando solo los recursos permitidos. Eso es especialmente importante para el GRE, donde el tiempo por sección es crítico y es necesario que te acostumbres a resolver las preguntas rápido.</p>

<h3 id="résumé"><em>résumé</em></h3>

<p>No te compliques con diseño en tu résumé, busca alguna plantilla para que siga una estructura estándar. Manténlo estrictamente de una cuartilla. Usa <em>bullet points</em>, busca guias de como redactarlos. Es importante que los puntos sean concisos y específicos.</p>

<h3 id="carta-de-motivos">Carta de motivos</h3>

<p>La carta de motivos es la única oportunidad que tienes para convencer al comité de admitirte, debe ser breve y concisa (una cuartilla es un tamaño apropiado). Usa este espacio para hablar de lo que has hecho, de lo que harás en el programa y lo que harás al graduarte. Haz énfasis en las particularidades de tu perfil, los comités de admisión valoran mucho la diversidad en todos los aspectos. Si hay algo en tu aplicación (ejemplo: puntaje bajo en tu licenciatura), aprovecha este espacio para explicar alguna situación extraordinaria que pudo haber afectado tu promedio, si es el caso.</p>

<h3 id="cartas-de-recomendación">Cartas de recomendación</h3>

<p>Los programas usualmente piden 3 cartas de recomendación. Deben ser estrictamente académicas/profesionales. Pidelas únicamente a personas que puedan hablar de tus aptitudes en específico, es preferible tener una carta recomendación de tu profesor de primer año de licenciatura con el que trabajaste en un proyecto a lo largo de un año entero que con el director de tu facultad/escuela que solo sabe tu nombre. Entre más detallada sea la carta, mejor, así que considera personas en las que sepas que se tomarán el tiempo suficiente para escribir una carta muy positiva y detallada. Si estás en tus primeros años de licenciatura, es el momento apropiado de comenzar a acercarte a tus profesores para construir relaciones académicas.</p>

<h2 id="comentarios-finales">Comentarios finales</h2>

<p>En la siguiente parte hablaré de cómo elegir los programas a los que vas a aplicar.  ¡hasta la próxima! Si tienes alguna duda o comentario, no duden en escribirme por Twitter <a href="http://twitter.com/edublancas/">@edublancas</a> o por correo electrónico <a href="mailto:edu@blancas.io">edu@blancas.io</a>.</p>]]></content><author><name>Eduardo Blancas</name><email>edu@blancas.io</email><uri>https://ploomber.io</uri></author><summary type="html"><![CDATA[Para inaugurar mi blog, he decidido escribir una serie de artículos para aquellos interesados en entrar a algún programa competitivo de maestría en EE.UU. en el área de STEM. Esta serie de artículos contendrá información que fui recopilando de diversas fuentes cuando me encontraba en el proceso, pero también cosas que tuve que aprender en el camino (y que me hubiera sido muy útil saber desde el principio).]]></summary></entry></feed>