Eduardo Blancas

Minifying HTML for GPT-4o: Remove all the HTML Tags

2024-09-02T00:00:00+00:00

tl;dr; if you want to pass HTML data to GPT-4o, just strip out all the HTML and pass raw text, it’s cheaper and there is little to no performance degradation. Source code and demo available.

Following up my earlier post on using GPT-4o for web scraping (and finding out how expensive it is!), I wanted to investigate approaches to lower the cost.

My hypothesis was that the document’s structure would have an important effect when trying to extract structured data and that I’d see an important cost vs. accuracy trade-off: by stripping out structure from the HTML document, I was expecting an important degradation in accuracy but this turned out to be false: GPT-4o doesn’t need any HTML structure to correctly extract data.

I used the Mercury Prize Wikipedia page as input data; this page has a reasonable size and it contains a long table with multiple entities (years, artists, albums, nominees), but most importantly, it’s a fun dataset to work with.

Experimental setup: questions

Since I wanted to test to what extent the HTML structure would have an effect on extraction quality, I asked GPT-4o two types of questions:

Unstructured: the information to correctly answer is included in the document’s paragraphs and the answer is a string
Structured: the information to correctly answer is included in the table and the answer is structured (a list of strings)

I asked 20 questions in total, 10 unstructured and 10 structured.

I varied the complexity of the questions. For the unstructured case it was pretty limited since there isn’t much wiggle room and I didn’t want to ask questions that involved math (as evaluating math capabilities is not the purpose of this experiment).

However, the structured case gave me more space to experiment. Here are some sample questions:

Give me the years for the 1st, 4th and 8th editions (in order)

The answer involves understanding the table structure and order.

Extract the shortlisted nominees (include the winner) for the 25th edition, only the artist names (they appear first, followed by the album)”

This answer also involves structure understanding: it first has to find the row for the 25th edition, then extract data from two columns (winner and nominees are in two separate columns) and then split data that appears in the same column (Artist - Album):

You can see all the questions in the source code.

Experimental setup: pre-processing

Next, I developed a couple of text pre-processing pipelines that transform the HTML document: the objective is to reduce the number of tokens to lower the cost (as OpenAI charges per token). I tried the following pre-processing pipelines:

No processing: the HTML document is passed as-is to the model (the most expensive approach!)
Clean HTML: excludes everything outside the tags, removes all attributes from HTML tags (except class, id, and data-testid), replaces class and id with increasing numbers (1, 2, 3, etc.), cleans up whitespace, and replaces TEXT with TEXT
HTML remover: completely removes all HTML and only keeps the text
Converts the HTML into markdown (I added this because some people recommended it on X/Twitter - LLMs are trained with a lot of markdown, hence, they’re expected to understand their structure)

Experimental setup: prompts

Here are the functions that I used to call GPT-4o and GPT-4o mini.

Unstructured

def answer_question(*, html_content: str, model: str, query: str) -> str:
    SYSTEM_PROMPT = """
You're an expert question-answering system. You're given a snippet of HTML content
and a question. You need to answer the question based on the HTML content. Your response should be a plain text answer to the question based on the HTML content. Your
answer should be concise and to the point.
    """

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": SYSTEM_PROMPT.strip(),
            },
            {
                "role": "user",
                "content": f"HTML Content: {html_content}\n\nQuestion: {query}",
            },
        ],
    )

    return completion.choices[0].message.content

Structured

class ParsedColumn(BaseModel):
    name: str
    values: List[str]


def parse_column(*, html_content: str, model: str, query: str) -> dict:
    SYSTEM_PROMPT = """
You're an expert web scraper. You're given the HTML contents of a table, a user
query and you have to extract a column from it that is related to the user query.

The name of the column should be the header of the column. The values should be the
text content of the cells in the column.
    """

    completion = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": f"HTML Content: {html_content}",
            },
            {
                "role": "user",
                "content": f"User Query: {query}",
            },
        ],
        response_format=ParsedColumn,
    )

    event = completion.choices[0].message.parsed
    return event.model_dump()

Model evaluation

I considered answers to the Unstructured questions to be correct when they contain the answer. Let’s see a sample question:

Which artist has been nominated the most times for the Mercury Prize without winning?

Any of these is considered a correct answer:

Radiohead
Radiohead is the band that has been nominated the most with no wins
The answer is Radiohead

In the Structured questions there were two cases: in some, the answer’s order did not matter, but in others, it did.

Here’s a sample question whose answer didn’t require ordering:

Extract the shortlisted artists (exclude the winner) for 2015. Only artist names (artists appear first, followed by the album)

answer = {
        "Aphex Twin",
        "Gaz Coombes",
        "C Duncan",
        "Eska",
        "Florence and the Machine",
        "Ghostpoet",
        "Róisín Murphy",
        "Slaves",
        "Soak",
        "Wolf Alice",
        "Jamie xx",
}

assert set(answer_gpt4) == answer

Here’s an example of a question whose answer required ordering:

Extract the winners of the Mercury Prize from 1992 to 1995, in order

Expected answer:

answer = [
    "Primal Scream – Screamadelica",
    "Suede – Suede",
    "M People – Elegant Slumming",
    "Portishead – Dummy",
]

assert answer_gpt4 == answer

Results

Unstructured

When asking unstructured questions, GPT-4o and its mini version have similar performance and the pre-processing doesn’t make a difference. Since the price gap is big, I recommend using GPT-4o mini for unstructured questions with all the HTML removed to maximize savings.

Structured

Structured questions paint a fairly different scenario: GPT-4o has considerably higher performance than the mini version. However, we see that pre-processing has little to no effect on accuracy. Given the price difference between models, I recommend testing both with a sample of your data and deciding if the accuracy gains justify the steep price increase. In both cases, you can remove all the HTML tags to reduce the price.

Raw results

model	input	cost	accuracy	question_type
gpt-4o-mini	raw	0.163094	0.8	unstructured
gpt-4o-mini	clean	0.052281	0.8	unstructured
gpt-4o-mini	unstructured	0.017891	0.9	unstructured
gpt-4o-mini	markdown	0.066414	0.8	unstructured
gpt-4o-mini	raw	0.049740	0.5	structured
gpt-4o-mini	clean	0.014858	0.3	structured
gpt-4o-mini	unstructured	0.004851	0.4	structured
gpt-4o-mini	markdown	0.027072	0.1	structured
gpt-4o-2024-08-06	raw	2.718225	0.9	unstructured
gpt-4o-2024-08-06	clean	0.871350	0.9	unstructured
gpt-4o-2024-08-06	unstructured	0.298175	0.9	unstructured
gpt-4o-2024-08-06	markdown	1.106900	0.9	unstructured
gpt-4o-2024-08-06	raw	0.829000	0.8	structured
gpt-4o-2024-08-06	clean	0.247625	0.7	structured
gpt-4o-2024-08-06	unstructured	0.080850	0.7	structured
gpt-4o-2024-08-06	markdown	0.451200	0.7	structured

Final comments

Until GPT-4o becomes cheaper, data extraction tasks require some careful evaluation to avoid breaking the bank. You might be just fine with GPT-4o mini for some cases but GPT-4o’s performance is much better for others, so evaluate for your use case.

Models have inherent randomness but I didn’t include accuracy ranges in the results as that’d involve a higher OpenAI bill (check out my startup, if you become a customer, I’ll be able to justify a higher budget for these experiments!). But I doubt that repeating the experiments would flip the conclusions.

If you want to run the benchmark, here’s the source code. If you want to play with the pre-processing pipelines, try this demo app: https://orange-sea-7185.ploomberapp.io, it’ll allow you to enter a URL and estimate the savings.

If you have questions, ping me on X.

Using GPT-4o for web scraping

2024-08-28T00:00:00+00:00

tl;dr; show me the demo and source code!

I’m pretty excited about the new structured outputs feature in OpenAI’s API so I took it for a spin and developed an AI-assisted web scraper. This post summarizes my learnings.

Asking GPT-4o to scrape data

The first experiment was to straight ask GPT-4o to extract the data from an HTML string, so I used the new structured outputs feature with the following Pydantic models:

from typing import List, Dict

class ParsedColumn(BaseModel):
    name: str
    values: List[str]


class ParsedTable(BaseModel):
    name: str
    columns: List[ParsedColumn]

The system prompt is:

You’re an expert web scraper. You’re given the HTML contents of a table and you have to extract structured data from it.

Here are some interesting things I found when parsing different tables.

Note: I also tried GPT-4o mini but yielded significantly worse results so I just continued my experiments with GPT-4o.

Parsing complex tables

After experimenting with some simple tables, I wanted to see how the model would do with a more complex ones, so I passed a 10-day weather forecast from Weather.com. The table contains a big row for at the top and smaller rows for the other 9 days. Interestingly, GPT-4o was able to parse this correctly:

For the 9 remaining days, the table shows a day and a night forecast (see screenshot above). The model correctly parsed such data and added a Day/Night column. Here’s how it looks like in the browser (note that to display this, we need to click on the button to the right of each row):

At first, I thought that the parsed Condition column was a hallucination since I did not see that in the website, however, upon inspecting the source code, I realized that those tags exist but are invisible in the table.

Combined rows break the model

When thinking where to find easy tables, my first thought was Wikipedia. Turns out, a simple table from Wikipedia (Human development index) breaks the model because rows with repeated values are merged:

And while the model is able to retrieve individual columns (as instructed by the system prompt), they don’t have the same size, hence, I’m unable to represent the data as a table.

I tried modifying the system prompt with the following:

Tables might collapse rows into a single row. If that’s the case, extract the collapsed row as multiple JSON values to ensure all columns contain the same number of rows.

But it didn’t work. I have yet to try modifying the system prompt to tell the model to extract rows instead of columns.

Asking GPT-4o to return XPaths

Running an OpenAI API call every time can become very expensive, so I figured I’d ask the model to return XPaths instead of the parsed data. This would allow me to scrape the same page (e.g., to fetch updated data) without breaking the bank.

After some tweaks, I came up with this prompt:

You’re an expert web scraper.

The user will provide the HTML content and the column name. Your job is to come up with an XPath that will return all elements of that column.

The XPath should be a string that can be evaluated by Selenium’s driver.find_elements(By.XPATH, xpath) method.

Return the full matching element, not just the text.

Unfortunately, this didn’t work well. Sometimes, the model would return invalid XPaths (although this was alleviated with the sentence that mentions Selenium) or XPaths that would return incorrect data or no data at all.

Combining the two approaches

My next attempt was to combine both approaches: once the model extracted the data, we could use it as a reference to ask the model for the XPath. This worked much better than straight asking for XPaths!

I noticed that sometimes the generated XPath would return no data at all so I added some dumb retry logic: if the XPath returns no results, try again. This did the trick for the tables I tested.

However, I noticed a new issue: sometimes the first step (extract data) converted images into text (e.g. an arrow pointing upwards might appear in the extracted data as “arrow-upwards”), this caused the second step to fail since it’d look for data that wasn’t there. I did not attempt to fix this problem.

GPT-4o is very expensive

Scraping with GPT-4o can become very expensive since even small HTML tables can contain lots of characters. I’ve been experimenting for two days and I’ve already spent $24!

To reduce the cost, I added some clean up logic to remove unnecessary data from the HTML string before passing it to the model. A simple function that removes all properties except class, id, and data-testid (which are the ones I noticed the generated XPaths were using) trimmed the number of characters in the table by half.

I didn’t see any performance degradations and my suspicion is that the results would actually improve extraction quality.

Currently, the second step (generate XPaths) makes one model call per column in the table, another improvement could be to generate more than one XPath, I have yet to try this approach and evaluate performance.

Conclusions and demo

I was surprised by the extraction quality of GPT-4o (but then sadly surprised when I looked at how much I’d have to pay OpenAI!). Nonetheless, this was a fun experiment and I definitely see potential for AI-assisted web scraping tools.

I did a quick demo using Streamlit, you can check it out here: https://orange-resonance-9766.ploomberapp.io, the source code is on GitHub (Spoiler: don’t expect anything polished).

I wanted to test more tables; however, since that’d involve a higher OpenAI bill, I only tried a handful of them. (check out my startup, if you become a customer, I’ll be able to justify a higher budget for these experiments!).

Some stuff I’d like to try if I had more time:

Capture browser events: the current demo is a one-off process: users enter the URL and an initial XPath. This isn’t great UX as it’d be better to ask the user to click on the table they want to extract, and to provide some sample rows so the model can understand the structure a bit better.
In complex tables, a single XPath might not be enough to extract a full column, I’d like to see if asking the LLM to return a program (e.g. Python) would work.
More experimenting with the HTML clean up is needed. It’s very expensive to use GPT-4o and I feel like I’m passing a lot of unnecessary data to the model

Don’t make users read your docs

2022-07-23T00:00:00+00:00

As an open-source maintainer, I always put effort into documenting all known edge cases so that users know how to fix problems. So, whenever users report incompatibilities, we highlight them in our documentation. Still, I realized this approach wasn’t working when users came to our Slack asking for help with problems we had already documented.

As project maintainers, we tend to be overly optimistic about how good the documentation is. But the target metric should not be how detailed our documentation is but how fast users can get things done. And when things go wrong, reading the documentation is not always the quickest route, so don’t make your users read your docs, help them right on the spot.

Motivating example

A few weeks ago, a user reported an issue. The details are not important, but it required us to add a new argument to a class. We added the argument to the constructor, documented it, and posted the solution in the GitHub issue; however, when thinking about what would happen if a new user had the same issue, I realized we solved the problem for one user but not the rest. Most likely, other users would have a hard time trying to fix the issue, and most likely, they’d give up if they didn’t find the answer quickly.

Useful error messages

A helpful error message tells you three things:

What failed
Why it failed
How to fix it

For example:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method

This error message contains the three elements:

Cannot re-initialize CUDA [What failed]
…in forked process [Why it failed]
Use the ‘spawn’ start method [How to fix it]

The problem is that our framework builds an abstraction, so users don’t have to use the multiprocessing module directly; hence, the user couldn’t fix the issue unless they modified the source code.

In our specific use case, here’s a better error message:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, Pass ‘spawn’ to the ‘start_method’ argument of the Parallel executor constructor

Let’s see how to achieve this.

Helpful error messages

Note: the following sections contain Python code snippets, but the idea applies to any language.

We want to anticipate the error and tell the user how to get things running:

from some_package.exceptions import SomeException

def thing_that_breaks(argument):
    ...


def thing_that_the_user_calls(argument):
    try:
        thing_that_breaks(argument=argument)
    except SomeException as e:
        # add more context and raise whatever exception type makes sense
        raise RuntimeError('How to fix it') from e
    except:
        # raise the original exception, unmodified
        raise
    ...

Note: the raise exception from another_exception expression is called a chained exception in Python.

The previous snippet will provide the user-specific instructions when encountering the problem using our software.

However, we’re assuming that:

We can import some_package.exceptions in our project’s codebase (which implies adding it as a dependency)
We are sure that when SomeException is raised, the solution is what we are displaying to the user

Sometimes exceptions are too general, so we need to dig deeper. In such cases, we can use the error message as a proxy:

def thing_that_breaks(argument):
    ...


def thing_that_the_user_calls(argument):

    try:
        thing_that_breaks(argument=argument)
    except Exception as e:
        if 'some hint' in str(e):
            raise Exception('Instructions on how to fix it') from e
        else:
            raise
    ...

There are obvious drawbacks to this approach: the error message might change, although the same is true for the exception type, so in any case, ensure you have unit tests in place.

I’ve encountered cases where checking the error message isn’t enough, and we might display inaccurate instructions. In such situations, I write an error to reflect that:

If having issues with X, try [possible solution]

The end

If you enjoyed this, let’s connect on Twitter, where I often post my adventures as open-source maintainer, and if you do Data Science, check out our project.

5 signs your Data Science workflow is broken

2019-07-16T00:00:00+00:00

Developing reproducible data pipelines is hard, but before we even think about reproducibility your project has to meet some minimum standards. This post discusses some recurring bad practices when developing data pipelines and provides some advice to overcome them.

1. Lack of setup instructions

The first step for every software project is to get the environment up a running (e.g., install UNIX package A, then install Python 3.7, then install Python libraries X, Y and Z), however, more often than not, the environment is setup once and instructions are never recorded.

Data Science projects often depend on complex software setups (e.g., installing GPU or database drivers); lack of instructions will surely cause a lot of trouble for the team, especially when a new member joins or when the project is taken to a production environment.

This setup instructions have to be always up to date, they will break if a single new dependency is not registered and unnecessarily complex if any dependency is no longer needed.

How to fix it? All projects should come with a shell script to setup the project, package managers do the heavy lifting for installing software so you might assume that one is already installed.

To prevent setup instructions become outdated, test them every time your code changes by using a Continuous Integration service such as Travis CI. While CI services can detect when your dependencies no longer work, they cannot detect unnecessary libraries, those you have to remove manually from the setup script.

2. Environment configuration embedded in the source code

If you keep seeing this error message when running your pipeline:

"/Users/coworkersname/data/clean_v2.parquet" file not found.

It is probably because someone in the team hardcoded a path to a file/directory that only exists in their machine. Even if you are working in a shared filesystem, it is a good idea to keep files separate to prevent accidentally overwriting their work. Explicit paths should never make it to the code.

How to fix it? Keep all things such as I/O paths and host addresses in a separate place and read from there. For example, you might have a file like this in your project’s root directory:

# locations.yaml
data:
    # all raw data goes here
    raw: ~/project/data/raw/
    # all processed data goes here
    processed: ~/project/data/processed

# host to the database
db: db.organization.com:5421/database

Everyone then should treat that file as a contract: you must read and write only to these directories. Each member can customize their configuration file and nothing should break. Your code will look like this:

from pathlib import Path

import pandas as pd
from my_project import locations


def clean_data():
    # load content of locations.yaml
    path_raw = locations['data']['raw']
    path_clean = locations['data']['processed']

    # read a file relative to the raw data folder...
    pd.read_parquet(Path(path_raw, 'dataset.parquet'))

    # clean the data...

    # write to a file relative to the clean data folder...
    pd.to_parquet(Path(path_clean, 'dataset.parquet'))

Make sure the file is easily discoverable inside your scripts, you might want to create a function that automatically finds a locations.yaml file in the current working directory or any parent folders up to certain levels and raises and Exception if it cannot find one.

3. End-to-end pipeline execution requires manual intervention

A pipeline is not such if it needs manual intervention to run. Given the raw data, you should be able to run the pipeline end-to-end with a single command. For starters, that means you should only use scripting tools such as Python or R, and no GUI tool such as Excel.

Automated execution is a prerequisite for automated testing. Bugs are inevitable, but automated testing can save you from finding those bugs in a production environment.

How to fix it? If setup instructions are provided and there are not hardcoded paths, having an automated pipeline will be easier. As with setup instructions, the only reliable way to keep this working is to include a shell script in the CI service to make sure your pipeline still runs. If you are working with large datasets, you may want to pass a sample of the data for testing purposes.

4. Intermediate results are shared over e-mail/cloud storage

A (unfortunate) common practice in many data analysis projects is to share intermediate results. Reasons vary but the pattern goes like this: member A updates some code in the pipeline that B needs as input, so A runs the updated code and shares the new results with B, who then uses this new file as input instead of the old version.

Sharing intermediate results is a terrible practice since it makes reproducibility harder. Intermediate results should never be shared, A should just push the new code and B should execute it to get the new input to use.

How to fix it? Fixing this pattern is harder, all previous sections are prerequisites for this one, namely:

There should a setup script to configure setup the environment
Configuration should be centralized in a single file, out of the source code
There should be a script to execute the pipeline end-to-end

If all those requisites are met, there is no need to share intermediate files.

The only situation when sharing intermediate files might be necessary is when any of your tasks either a) takes a lot to run or b) it has to be run in a restricted environment (e.g., a shared cluster). In such case, special care should be given to make sure that the code that produced some results is appropriately stored in version control. Avoid this situation as much as possible.

For most projects, this should not be the case. If you are working with large datasets, you probably already have some distributed infrastructure which makes your computationally heavy scripts run in a reasonable amount of time, if they do not, consider splitting them in smaller steps.

5. A change in a single step requires you to execute the pipeline end-to-end

During development, it is always the case that steps are revisited (added features, fixed bugs). Every time you make a change, you have to make sure that all changes propagate to steps downstream. Since steps in a data pipeline often take minutes or even hours to run, an update should only trigger execution on their downstream dependencies to avoid wasteful computations.

If there is no way for your pipeline to know which steps are affected by any given update you only have two choices: either to run the entire pipeline again or manually check which steps have to be run. Both options are a waste of your time.

How to fix it? There is not a single answer here. I have not found any library to easily fix this issue (I implemented my own solution but it is not publicly available yet). If all your processing is done locally, my recommendation is to use Make.

Final comments

I hope this post helps you find areas for improvement in your data projects. Putting attention to this issues will pay off in the long run. A working workflow not only will increase your productivity to get your analysis right faster but will help you build more robust data products.

The case against data versioning

2019-06-27T00:00:00+00:00

A recent technique to advocate for reproducibility in data analysis is data versioning, which means that some (or all) intermediate files generated by the pipeline are saved and tagged so we can come back to them at any moment. But I think data versioning is actually harmful for reproducibility.

Reproducibility is defined as the “ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. Reproducibility is a minimum necessary condition for a finding to be believable and informative (Source).” The key term here is materials. The only materials in a data pipeline are the raw data and the code, all other artifacts are byproducts which should not be considered.

We can test for reproducibility by answering the following question: given the same raw data and code, do we get the same results? Using intermediate results and claim reproducibility is cheating, since we are overlooking all previous computations that originated such interim results.

Reproducibility can only be achieved by construction, it is not a feature you add to your pipeline. The bad news is that you cannot do pip install reproducibility; the only way to achieve it is through better software engineering practices. The good news is that verifying reproducibility is trivial.

Verifying reproducibility

Verifying that a data pipeline is reproducible is as simple as passing raw data and comparing the claimed final output:

import pipeline

result_final = pipeline.run('/path/to/raw/data.csv')
result_expected = 42

assert result_final == result_expected

Saving intermediate results is useful (for other purposes)

Data pipelines are built from steps that run one after the other, when the final output is unexpected, storing intermediate results makes our pipeline more transparent: we can go inspect those results and identify which step went wrong.

They are also useful for avoiding redundant computation. Pipelines usually take a lot of time to run, doing an end-to-end run every time we make a small change is wasteful. Selectively running the steps affected by the changes should lead us to the same result had we executed the pipeline end-to-end.

Versioning your final output

There is one piece of your pipeline that you can version: the final output. If you want to automate reproducibility verification, then you might want to store the final output to compare with the pipeline’s output whenever a change is introduced. If results do not match, your pipeline is no longer reproducible.

Aplicando a una maestría en EE.UU. (II): Seleccionando programas

2019-04-07T00:00:00+00:00

En esta segunda parte escribiré sobre qué parámetros tomar en cuenta para elegir los programas a los que aplicarás. Considera unos 10-15 programas en tu búsqueda, para finalmente aplicar a unos 6. Lo más importante es la calidad del programa pero considera que entre más prestigio tenga, más competitiva será la admisión, el resto de los aspectos que mencionaré no están en ningún orden particular, dependerá de ti cuál consideres más importante.

Aspectos a considerar

tl;dr; investiga la calidad del programa en diversas fuentes, asegúrate que tu programa tenga eventos de reclutamiento, considera que los costos son muy variables y que necesitarás demostrar solvencia económica para obtener la visa

Calidad del programa

El parámetro más importante es la calidad del programa pero también es el más difícil de evaluar. Usar los rankings es algo engañoso porque hay escuelas que son conocidas solo en algunas áreas, por lo que no figuran en los rankings generales (ejemplo: Carnegie Mellon en Computer Science). Es mejor guiarse por los rankings por área, pero aún en ese caso, puede haber más de un programa en el mismo departamento. Mi recomendación es usar todos los recursos que tengas a tu alcance, y pon especial énfasis en los detalles de cada programa. La mejor forma de obtener información más detallada sobre programas en específico es hablar con algún graduado (LinkedIn es un buen recurso para esto).

Pon atención en cuestiones como: materias obligatorias, oferta de optativas, posibilidad de tomar materias en otros departamento o escuelas cercanas, tamaño del programa (un programa de 30 personas es una experiencia muy diferente a uno de 200), estadísticas de los graduados.

Salida laboral

Si tienes interés en hacer una pasantía durante el verano o trabajar temporalmente después de tu programa (tanto la visa de estudiante F1 como la J1 te permiten hacerlo) es importante que investigues si los programas tienen ferias de reclutamiento; en mi experiencia, esta es una de las mejores formas de obtener una entrevista (la otra es obtener una recomendación de alguien que trabaje ahí). Es importante que investigues por reclutamientos exclusivos para el programa de tu interés, los eventos a nivel escuela no son tan efectivos, va tanta gente que no tienes suficiente tiempo de entablar una conversación con los reclutadores.

Costo del programa

Este punto es el más tedioso, sobre todo si apenas te encuentras buscando opciones, pero informarte con antelación te evitará muchas complicaciones después. El costo anual de colegiatura varía entre universidades y los costos de vida varían aún mucho más de ciudad a ciudad (vivir en NYC es mucho más caro que vivir en Austin). Al momento de solicitar la visa, tendrás que probar solvencia económica a través de una cuenta bancaria, documentos de becas obtenidas, préstamos, etc. Es muy probable que en esta etapa del proceso aún no sepas si se te ha otorgado alguna beca, por lo que si dependes de alguna beca para cubrir los gastos, seleccionar un programa más asequible disminuirá el riesgo de complicaciones si la beca no te es otorgada (en el siguiente artículo me enfocaré en hablar de las opciones de becas, créditos y otras opciones para financiar tu programa).

Pocos programas a nivel maestría ofrecen becas y suelen otorgarlas cuando envían tu carta de aceptación, así que si hay alguno de tu interés que ofrezca beca, tómalo en cuenta. La duración del programa es muy importante porque impacta directamente en el costo, la mayoría son de dos años pero existen también de 3 semestres e incluso de un año. A pesar de que un programa más largo te dará más tiempo para profundizar tu aprendizaje, el aumento en el costo será considerable.

Ubicación de la universidad

Además del impacto económico que tiene la ubicación de la universidad, también impactará tu experiencia. En primer lugar debes considerar que algunas universidades están ubicadas en ciudades grandes (Nueva York, Chicago), otras en ciudades no tan grandes (Boston) y otras en donde lo único que hay es la universidad. Ciertamente la calidad de tu programas es mucho más importante que la oferta de entretenimiento en la ciudad, pero al menos haz el ejercicio mental de imaginarte viviendo en tal o cual ciudad y asegúrate que te sentirás cómodo ahí.

Por otro lado, la ubicación también afectará tus prospectos laborales. A pesar de que la mayoría de las solicitudes se realizan por internet y nada te impide irte a trabajar a otra ciudad, la cercanía con la industria hará más sencillo este proceso. Para empezar, las empresas suelen reclutar en las universidades locales, además, si empiezas a buscar trabajo antes de graduarte, será mucho más fácil acudir a las entrevistas si las empresas están en la misma ciudad (la última ronda de entrevistas siempre es en la oficina de la empresa), en caso contrario, tu disponibilidad para tomar las entrevistas estará limitado por el tiempo que puedas ausentarte sin afectar tu rendimiento escolar.

¿A cuántos programas aplicar?

tl;dr; aplica a 6 programas, elige los dos que más quieras, dos no tan competitivos y dos opciones “seguras”

Una vez que tengas una lista de unos 10-15 programas, es momento de decidir a cuales aplicarás. Toma en cuenta que la solicitud a cada programa tendrá un costo de 75-100 USD, pero aún más importante es que tendrás que enviar la carta de motivos y cartas de recomendación diferentes a cada programa. ¿Qué tan diferentes? Eso depende de ti. A pesar de que puedes enviar las mismas cartas, considero que es una muy mala estrategia pues demostrará poco interés de tu parte. Como mínimo, deberás incluir en tu carta (y quienes te recomienden en las suyas) el nombre de la universidad y el programa; idealmente, una porción de tu carta de motivos estará dedicada a hablar de los detalles de cada programa para escribir por qué deberían admitirte (pedirle una carta de recomendación diferente por programa es muy complicado, así que como mínimo pide que cambien el nombre).

Por otro lado, entre menos solicitudes envíes, más riesgo habrá de que no quedes en ningún programa (sí, eso pasa, sobre todo en programas competitivos donde los porcentajes de admisión suelen ser de un dígito). Te recomiendo que no apliques a menos de 6 programas; todos ellos deben ser programas en los que si recibes admisión estes bien seguro de que te inscribirás. Es importante que selecciones esos programas de manera que disminuyas el riesgo de no quedarte en ninguno; recomiendo que lo hagas de la siguiente manera: 2 de ellos pueden ser libres (los dos mejores programas en tu área, por ejemplo), otros 2 pueden ser programas que no sean tan competitivos y dos opciones “seguras”. La dificultad radica en cómo evaluar qué tan probable es recibir admisión, en ese caso, lo mejor es hablar con un experto que pueda evaluar tu perfil.

Comentarios finales

Elegir los programas a los que aplicarás no es sencillo, así que dedícale suficiente tiempo. En la siguiente parte hablaré de las becas y créditos disponibles así como otras opciones para financiar tu programa. Si tienes alguna duda o comentario, no duden en escribirme por Twitter @edublancas o por correo electrónico edu@blancas.io.

Aplicando a una maestría en EE.UU. (I): Planeando tu aplicación

2019-03-30T00:00:00+00:00

Para inaugurar mi blog, he decidido escribir una serie de artículos para aquellos interesados en entrar a algún programa competitivo de maestría en EE.UU. en el área de STEM. Esta serie de artículos contendrá información que fui recopilando de diversas fuentes cuando me encontraba en el proceso, pero también cosas que tuve que aprender en el camino (y que me hubiera sido muy útil saber desde el principio).

Toma en cuenta que estos artículos se basan únicamente en mi experiencia y es imposible dar guías paso a paso debido a que cada escuela tiene criterios diferentes. La mayoría de la información de este primer artículo la obtuve de Quora y Magoosh, te recomiendo busques recursos más detallados en esas páginas.

En esta primera parte hablaré de qué tomar en cuenta si estás considerando aplicar: ya sea que te encuentres en los primeros años de licenciatura, estés a punto de graduarte o estés a unos meses de comenzar el proceso de aplicación.

Lo primero que hay que mencionar es que todos los aspectos de tu aplicación son importantes y la única forma de mejorar tus posibilidades es tener un perfil competitivo.

Si estás leyendo esto durante los primeros años de tu licenciatura

TL; DR Mantén un promedio arriba de 9.2 e involúcrate en actividades académicas (preferentemente con alguna institución en EE.UU.).

Tu promedio importa

Si estás cursando estudios de licenciatura y estás considerando una maestría en EE.UU. en un programa altamente competitivo, tu promedio de licenciatura es algo que debes cuidar. A pesar de que las escuelas indican que “no tienen promedio mínimo requerido”, un promedio bajo puede dejarte fuera (aunque uno alto no garantiza la admisión). En general, considera que un promedio “bueno” para uno de estos programas es de 3.7/4 (equivalente a un 9.2/10). Si tu promedio es menor que eso, no quiere decir que no tienes posibilidades de ser admitido, pero deberás compensarlo en otros aspectos de tu aplicación (con un excelente puntaje en el GRE, por ejemplo).

Dos consideraciones importantes: si provienes de una escuela estricta en sus evaluaciones y algún miembro del comité lo sabe, será un factor que considerán. Es difícil saber qué tan conocida es tu escuela para el comité de admisión, pero puedes investigar si exalumnos de tu escuela se han graduado de los programas de tu interés, o mejor aún, si algun profesor del programa es graduado de tu universidad.

Otro detalle importante es que el comité de admisión dará más importancia a las calificaciones de tu área que a las demás (un 7 te afectara más si fue en cálculo que si fue en literatura).

…tus actividades fuera del salón también

Una forma de sobresalir entre los aplicantes es demostrar que estás involucrado en tu área fuera del salón de clases. Si tienes oportunidad de involucrarte en proyectos en tu universidad o alguna empresa te puede ayudar mucho (el verano o un semestre de intercambio son buenas formas para lograrlo). Algo que definitivamente puede hacer la diferencia es si estos proyectos los haces en alguna universidad en EE.UU., mejor aún si es en alguna escuela con prestigio en tu área de interés.

Algunos programas de maestría son enfocados en investigación (tienes que escribir una tesis), esto sucede sobre todo en el área de ciencias (en el área de ingeniería los programas suelen ser más aplicados). Si tu programa requiere escribir una tesis, es importante que enfoques tus actividades académicas en cuestiones de investigación (en vez de hacer proyectos aplicados en una empresa, por ejemplo), mucho mejor si estos proyectos culminan en publicaciones científicas.

…estudia inglés

Este es un punto obvio, pero no quiero dejarlo fuera. Si estás en tus primeros años de licenciatura y no puedes hablar en inglés con fluidez, es importante que te pongas a practicar con tiempo pues el examen de inglés que te pedirán (TOEFL iBT) tiene una sección de speaking.

¿Y si ya me gradué o estoy a punto de graduarme?

Si ya te graduaste o estás a punto de hacerlo, será más difícicil subir tu promedio o involucrarte en actividades académicas, por lo que tener un buen puntaje en los exámenes (siguiente sección) es muy importante. Si ya tienes algunos años de haberte graduado, tu experiencia laboral (principalmente si es en el área de la maestría) también puede ayudarte.

Los requisitos para la solicitud de admisión

TL;DR Obtén un score en el TOEFL mínimo de 100 puntos y un puntaje cuantitativo en el GRE al menos en el percentil 90.

Prácticamente todos los programas a los que apliques van a pedir los mismos requisitos: GRE, TOEFL iBT, résumé, carta de motivos y cartas de recomendación. Es importante que planees cómo cubrir estos requisitos con suficiente tiempo de antelación. Las aplicaciones son en diciembre, te recomiendo que hagas un primer intento de los exámenes, que comiences a trabajar en la carta de motivos y las cartas de recomendación unos 6 meses antes. Puede parecer mucho tiempo, pero hay muchos factores que estarán fuera de tu control (fechas disponibles para tomar los exámenes, ejemplo) y te aseguro que te tomará más tiempo de lo que planees.

Otro aspecto importante es la selección de programas, también te recomiendo comenzar a investigar unos 6 meses antes, la siguiente entrada en esta serie se enfocará en eso. Usualmente los programas no tienen puntajes mínimos en los exámenes pero en algunos casos sí hay (por ejemplo, recuerdo haber visto un programa con puntaje mínimo en la sección de speaking del TOEFL iBT), por eso es importante tener una idea de los programas a los que aplicarás por si alguno tiene requisitos de este tipo.

Existe una cantidad enorme de recursos de cómo preparar tu solicitud, así que seré breve y únicamente incluiré los puntos que considero más importantes.

Consejos para los exámenes

TOEFL iBT

Dar consejos respecto al TOEFL es muy difícil dado que depende mucho de qué tan preparado estés unos meses antes de aplicar. Idealmente ya tienes un buen nivel y solo quieres dedicarle un poco de tiempo para tener un mejor puntaje. La mejor forma de tener un buen diagnóstico es hacer el examen una vez. Si alcanzas un puntaje muy bueno (>=110), olvídate del TOEFL y enfócate en el GRE, si tienes un puntaje no tan bueno (menos de 100), considera tomar algún curso para incrementar tu puntaje.

GRE

En programas técnicos, tu puntaje de GRE es un requisito básico (como tu promedio de licenciatura). Las secciones de verbal reasoning y analytical writing son solo un requisito, mientras tengas un puntaje regular (lo regular depende de cada programa), el comité de admisión probablemente no le dé tanta importancia. El puntaje importante es el de quantitative reasoning. Como regla de dedo considera que un puntaje bueno es estar en o por arriba del percentil 90. El mejor recurso que encontré para estudiar para este examen y obtener estadísticas de qué puntajes se consideran buenos es Magoosh. De igual forma, te recomiendo tomar el examen unos 6 meses antes para evaluar tu situación y determinar si es necesario estudiar y volver a tomar el examen.

¿Cómo prepararme?

El único consejo que puedo dar respecto a cómo prepararse es que practiques ambos exámenes en casa con un formato lo más parecido posible a como será cuando lo tomes de verdad: con cronómetro por sección y usando solo los recursos permitidos. Eso es especialmente importante para el GRE, donde el tiempo por sección es crítico y es necesario que te acostumbres a resolver las preguntas rápido.

résumé

No te compliques con diseño en tu résumé, busca alguna plantilla para que siga una estructura estándar. Manténlo estrictamente de una cuartilla. Usa bullet points, busca guias de como redactarlos. Es importante que los puntos sean concisos y específicos.

Carta de motivos

La carta de motivos es la única oportunidad que tienes para convencer al comité de admitirte, debe ser breve y concisa (una cuartilla es un tamaño apropiado). Usa este espacio para hablar de lo que has hecho, de lo que harás en el programa y lo que harás al graduarte. Haz énfasis en las particularidades de tu perfil, los comités de admisión valoran mucho la diversidad en todos los aspectos. Si hay algo en tu aplicación (ejemplo: puntaje bajo en tu licenciatura), aprovecha este espacio para explicar alguna situación extraordinaria que pudo haber afectado tu promedio, si es el caso.

Cartas de recomendación

Los programas usualmente piden 3 cartas de recomendación. Deben ser estrictamente académicas/profesionales. Pidelas únicamente a personas que puedan hablar de tus aptitudes en específico, es preferible tener una carta recomendación de tu profesor de primer año de licenciatura con el que trabajaste en un proyecto a lo largo de un año entero que con el director de tu facultad/escuela que solo sabe tu nombre. Entre más detallada sea la carta, mejor, así que considera personas en las que sepas que se tomarán el tiempo suficiente para escribir una carta muy positiva y detallada. Si estás en tus primeros años de licenciatura, es el momento apropiado de comenzar a acercarte a tus profesores para construir relaciones académicas.

Comentarios finales

En la siguiente parte hablaré de cómo elegir los programas a los que vas a aplicar. ¡hasta la próxima! Si tienes alguna duda o comentario, no duden en escribirme por Twitter @edublancas o por correo electrónico edu@blancas.io.