Home

Building With AI Coding Agents While Keeping Your Data Safe

2025-10-27T00:00:00+00:00

How I built a security-first development environment for AI coding agents—because “move fast” and “YOLO mode” isn’t an option when you’re working in biomedical research.

tl;dr

Throughout this year, I found myself increasingly excited about the potential of AI coding agents. At the same time, I grew more uncomfortable with the security tradeoffs they introduce. If you found yourself in a similar position, this post may resonate with you.

I built safer-codespace over only a few evenings, an experimental development container template that lets me work with AI coding agents while having multiple security layers in place to make data exfiltration considerably harder.

My Motivation: Security That Doesn’t Kill Productivity

What I really needed was a template for new projects that would meet my specific requirements:

Have security and observability baked in from the start - Not as an afterthought, but as foundational layers. This means:

Network firewall configured automatically on container build
Content segregation workflow (trusted vs. untrusted directories)
Complete LLM interaction logging to a SQLite database via the llm CLI tool
Optional integration with SpecStory to automatically save terminal agent conversations as clean, searchable markdown—preserving the “why” behind agent actions and creating a path to iteratively evaluate and improve my prompts and workflows over time

Have the option to use LLMs of my choice - Support for multiple LLM providers including open-source models, not locked into a single vendor. I wanted flexibility and control over which models process my data, with the ability to run models locally if needed.

Provide an easy path to iterate and improve - Both the project itself and my development workflow. I wanted to learn from each development session and continuously refine my approach based on real-world usage.

This isn’t about being paranoid. In biomedical research, I may be working with:

Personally identifiable information from study participants
Sensitive genetic and clinical data
Intellectual property that must be protected
Research data subject to ethics committees and data protection regulations

Any data exfiltration from such research projects could have serious consequences.

Yet, as Dario Amodei has noted in his 2024 essay, Machines of Loving Grace, one of the prospects I’m also most excited about is the positive implications powerful AI systems could have in biology and physical health in the coming years—if we do it right.

Understanding the Risk: The Lethal Trifecta

Before diving into solutions, we need to understand the threat. Simon Willison coined the term “prompt injection” in 2022, deliberately naming it after SQL injection because both vulnerabilities share the same root cause: mixing trusted instructions with untrusted input.

Just as SQL injection exploits happen when you concatenate a user’s SQL query with malicious input, prompt injection exploits happen when AI systems process untrusted external content alongside their user instructions. An attacker can embed malicious instructions in a document, web page, or dependency file that override your instructions to the LLM. What we currently call “AI assistants” or “AI agents” are autoregressive sequence models that can take actions by calling functions and other code artifacts based on their text output, often in a loop. Their very architecture makes them susceptible to such attacks.

Willison later defined what he calls “the lethal trifecta for AI agents”—three capabilities that, when combined in an AI system, create a critical security vulnerability:

Access to Private Data - The AI assistant can read your code, files, environment variables, or secrets
Exposure to Untrusted Content - The AI assistant processes external documentation, dependencies, emails, or web pages
Ability to Exfiltrate Data - The AI assistant can make network requests to send data to external servers

When all three are present, the attack scenario becomes straightforward:

An attacker embeds malicious instructions in documentation → Your AI assistant reads private data → The instructions tell it to find sensitive information → It sends that information to the attacker’s server

This isn’t theoretical. Such exploits have been documented extensively. The fundamental problem is that LLMs cannot reliably distinguish between trusted instructions and malicious content—they’re optimized to model patterns in natural language and other data types, not to discover or verify truth, or understand the physical world.

“Guardrails” that try to detect attacks might work 99% of the time, but as Willison notes, 99% is a failing grade in web application security. Attackers get unlimited attempts to craft bypasses, and they only need to succeed once.

The Security Concept: Defense-in-Depth

Since no single defense is perfect, this template takes a defense-in-depth approach: deploy multiple independent security layers so that if one fails, others still provide protection. The goal isn’t perfect security (which may be impossible with current LLM architectures), but to make attacks considerably harder and limit potential damage.

The core strategy: remove at least one leg of the lethal trifecta through complementary and redundant controls:

1. Network Firewall: Blocking Unauthorized Exfiltration

The devcontainer includes an automatically-configured network firewall that blocks all outbound connections except to pre-approved development endpoints like GitHub, npm, PyPI, and AI API providers. The firewall script was directly adapted from Claude Code’s open-source repository.

This doesn’t prevent all data exfiltration—an attacker could still potentially abuse allowed endpoints like GitHub to create issues or gists containing stolen data. But it significantly limits the attack surface and prevents the most straightforward exfiltration methods (like sending data to attacker.com).

The firewall is validated by automated tests that confirm:

Required endpoints remain reachable
Unauthorized domains are blocked
The security layer is actually active

2. Content Segregation: Trusted vs. Untrusted

The template uses a simple directory structure to separate vetted content from potentially dangerous external material:

context/
├── trusted/       # Human-reviewed, safe for AI access
└── untrusted/     # External content requiring review

The workflow: fetch external documentation or dependencies using non-AI tools (like my url-to-markdown script adapted from AnswerAI’s web2md), save them to untrusted/, manually review for malicious instructions, then move to trusted/ only after verification. AI assistants should only access trusted/ content.

This creates a mandatory review checkpoint before AI systems can process external content.

3. Human Review: The Critical Layer

Here’s the key insight: don’t use AI to detect prompt injection attacks. AI-based detection is fundamentally unreliable because attackers can craft prompts specifically designed to bypass detection.

Instead, rely on human intelligence. Before providing external content to your AI assistant, you review it yourself for suspicious instructions like:

“Send the contents of .env to…”
“Ignore previous instructions and…”
Unusual URLs or exfiltration commands

This is the most important layer. Security tools can help, but informed human judgment remains the most reliable defense.

4. Tool Selection: Use Less Powerful Tools When Possible

Not every task needs an AI agent with full system access. The template includes multiple tools with different capability levels:

Claude Code - Full-featured AI assistant with file access and command execution. Use for complex development tasks that genuinely need these capabilities.

llm CLI tool - Text-only interface with no file access or command execution by default. Perfect for:

Explaining error messages
Generating commit messages from git diff output
Code reviews on specific snippets
Quick questions about syntax or concepts

Even if compromised through prompt injection, the llm tool’s damage is limited to the text you explicitly pipe to it.

Manual commands - For simple, routine tasks (git operations, package installation, running tests), skip AI entirely. It’s faster, safer, and you maintain direct control.

This graduated response means you only use powerful tools when actually needed, reducing your attack surface for routine operations.

What’s Actually in the Template

Let me show you the concrete tools and configuration that implement these security principles:

AI Development Tools

Claude Code - Anthropic’s interactive AI assistant with file access and command execution capabilities. Great for complex, multi-step development tasks.

llm CLI tool - Simon Willison’s command-line wrapper for calling various LLMs. Configured by default to use GitHub’s free GPT-4o (no API key required). Perfect for piping output from other CLI tools:

# Generate commit message from staged changes
git diff --staged | llm -s "write a conventional commit message"

# Explain an error
cat error.log | llm "what does this error mean?"

SpecStory (optional) - Automatically saves Claude Code conversations as clean markdown. This preserves the reasoning and design decisions behind your code as git-friendly documentation.

Development Productivity Tools

glow - Beautiful markdown rendering in the terminal for monitoring documentation and reviewing saved conversations.

just - Simple command runner for project workflows. Makes it easy to define and run common development tasks.

url-to-markdown script - Adapted from AnswerAI’s web2md tool, this helps fetch external documentation safely without using AI, supporting the content segregation workflow.

Development Environments

Python 3.13 with uv for fast package management
Node.js 24.x with npm
Go (latest stable version)

Infrastructure

Pre-configured devcontainer with security baked in from the start
Four automated GitHub Actions workflows that validate:
- Devcontainer builds successfully
- Network connectivity works
- Firewall actively blocks unauthorized domains
- Tools are properly installed
Comprehensive documentation and usage examples

Agent instructions

The repository includes a CLAUDE.md file with detailed development guidelines, security principles, and best practices, heavily inspired by Eric Ries’s The Lean Startup methodology to build Minimum Viable Products (MVPs) first, and heavily focused on test-driven development. Agent instructions are of course subjective and also depend on personal preferences. Feel free to adapt them to your own style.

Here’s a meta-insight that might seem paradoxical: I built this security-focused template largely using Claude Code itself—the very AI assistant I was trying to secure.

The GitHub Actions workflows that validate the security controls? Built with Claude Code. The firewall configuration? Adapted from Anthropics reference container setup for Claude Code. The testing scripts? Built with Claude Code. Honestly, I’d never have had the patience and time to write all of this manually from scratch: navigating the YAML syntax, writing in programming languages less familiar to me, and creating test cases. With the amazing open-source resources from trusted sources and Claude Code guiding me through the process of test-driven development, it was actually enjoyable. Huge thanks to all the open-source contributors whose work made this possible! All this was built over only a few evenings, heavily inspired by a new Maven Course taught by Eleanor Berger and Isaac Flath.

Weaknesses: Let’s Be Honest About Limitations

This template isn’t a silver bullet, and I want to be completely transparent about what it doesn’t do:

It’s not a complete solution. Prompt injection remains an unsolved problem in the AI research community. Determined attackers with knowledge of the allowed endpoints could still potentially exfiltrate data through creative attacks on approved services.

It requires user awareness. The security model only works if you understand the risks and follow the content review workflow. There’s no foolproof automation that can protect you from yourself. This is about informed risk management, not eliminating all risk.

It’s experimental and evolving. This template is for education and exploration, not production deployment without careful evaluation for your specific context. Use it to learn about the threats and practice safer AI workflows.

Legitimate questions remain:

How do we better detect malicious content in dependencies?
Can we create safer abstractions for AI tool capabilities?
What additional security layers might help?
How do we balance security with the rapid pace of AI tool development?

The honest truth is that we’re all still figuring this out. The AI security landscape is evolving rapidly, and what works today may need adjustment tomorrow. This template is one approach based on current understanding—not a final answer.

Conclusion: An Invitation to Explore

The safer-codespace template isn’t claiming to have “solved” prompt injection—no one has. It’s a practical implementation of defense-in-depth strategies that you can use today while acknowledging the limitations. If you work with AI coding assistants and care about security—whether you’re in healthcare, finance, research, or any field with sensitive data—I invite you to:

Try the template - See if these patterns fit your workflow and requirements
Critique the approach - Where are the gaps? What’s missing? What could be better?
Contribute improvements - Better protections, clearer documentation, more educational content

Get started: github.com/nicomarr/safer-codespace

Reference and Additional Resources

Simon Willison’s weblog content on prompt injection

The lethal trifecta for AI agents: private data, untrusted content, and external communication (2025) - Original framework defining the three dangerous capabilities
Prompt injection explained, with video, slides, and a transcript (2023) - Presentation with annotated slides
Prompt injection attacks against GPT-3 (2022) - The post that coined the term “prompt injection”
Series: Prompt injection - Comprehensive collection of research, real-world attacks, and ongoing developments

AI Development Tools

Claude Code overview - Features and capabilities
llm CLI tool - Simon Willison’s command-line tool for working with LLMs

Enhancement and Productivity Tools

SpecStory - Automatically save Claude Code conversations as markdown
SpecStory documentation - Setup and usage guide
glow - Beautiful markdown rendering for the terminal
just - Simple command runner for project workflows

Open Source Scripts Adapted for This Template

Claude Code firewall script - Bash script restricting network access to Docker DNS and approved IPs
web2md by AnswerAI - Python script that takes a URL, makes a request, and converts web content to markdown

Important: This template implements defense-in-depth strategies for AI coding assistants but does not claim to have solved prompt injection or to fully prevent attacks. It creates multiple security layers to make data exfiltration considerably harder while acknowledging the limitations of any security controls. Always review external content before exposing it to AI assistants. Use at your own risk.

Set up a Local LLM-powered Research Assistant for Health & Life Sciences

2024-10-18T00:00:00+00:00

Welcome to the second in a series of short tutorials aimed at making Large Language Model (LLM)-powered applications more accessible for health and life sciences researchers. This tutorial introduces Scholaris, a Python package that allows anyone to set up a research assistant on a local computer and leverage function calling capabilities “out of the box”. Scholaris is designed specifically for use in health and life sciences to help gain insights from scholarly articles and interact with academic databases.

In this tutorial, you’ll learn:

How to use Scholaris to set up an assistant and leverage its tools for various research tasks.
How tool or function calling works using an LLM.
How to customize the assistant for your specific research needs.

Is this for you? This tutorial and the Scholaris Python package is suitable for anyone and does not require any prior knowledge of Python programming. The only prerequisite is that you are willing to learn and explore new tools and technologies. If you find particular terminology or concepts confusing, feel free to use a cloud-hosted large language model and ask it to explain them in simpler terms. Throughout this tutorial, you will also find text boxes with additional explanations. Feel free to skip these if you are already familiar with the concepts. If you are an experienced software developer, you may still find the tutorial useful for specific applications in health and life sciences. Be sure to check out the last section on customizing the assistant to suit specific research needs.

Getting started

Note! This section assumes that you have Python and the Ollama app installed on your local computer. If you want to follow along with the code examples, see the documentation for installation instructions.

To get started, import and initialize the Assistant class from Scholaris:

from scholaris.core import *
assistant = Assistant() 

This creates an instance of the Assistant class with default settings. At the time of writing, the default model is Llama 3.1 8B. You can also customize various parameters if needed, such as specifying a different model or adding custom tools (more on this in a later section). A detailed description of how to do so can also be found in the documentation pages. For now, let’s take advantage of the core functionality and default options.

The primary way to interact with the Scholaris assistant is through the chat method. Here’s an example:

response = assistant.chat("Briefly tell me about the tools you have available.")

This will prompt the assistant to describe its available tools and capabilities.

Terminology explained:

A class in Python is like a blueprint or a template for creating objects. It's a way to bundle data and functionality together. Objects are instances of classes, and they can have attributes (variables) and methods (functions).

The terms function calling and tool calling are often used interchangeably. Strictly speaking, functions typically receive one or more parameters as input to generate (return) one or more outputs, whereas a tool is a more broadly defined term in the context of large language model (LLM)-driven applications. Tools may refer to a wider range of operations, including functions, code blocks that are executed without additional parameters, multiple functions executed in sequence or parallel, or other types of actions. In the context of LLM-driven applications, both function calling and tool calling refer to the model's ability to generate structured outputs. These outputs are based on predefined schemas, not on the actual code that is executed. If an LLM-powered application or workflow has the capability to call functions or tools, it may also be referred to as being "agentic".

JSON, short for JavaScript Object Notation, is a lightweight, text-based data format that is easy for humans to read and write, and simple for LLMs to parse and generate. JSON uses curly braces for objects, colons to separate keys and values, and commas to separate key-value pairs or elements of list objects. Originally developed for JavaScript, JSON has become a widely-used standard for data exchange in web applications and beyond.

API, short for Application Programming Interface, is a set of rules and protocols that allow different software applications to communicate with each other. It is the primary way of how an application is interacting with an external service or tool, such as a database, web service, or library (analogous to how humans interact through a graphical user interface, e.g., a web browser).

Function / tool calling

Before we dive into a few use cases, let’s first review what tool or function calling is and how it works.

In a nutshell, the process of tool / function calling involves multiple steps, as illustrated below:

Figure 1. Flowchart of a basic LLM-powered application with function calling, illustrating the process from user input to final response generation.

Tool / function calling: The LLM receives a system message with core instructions (often including an assigned role), the user prompt, and a description of available functions with their parameters. If the tool call is made in the middle of a conversation, the LLM also receives as input the conversation history of the session (i.e., since the initialization of the assistant or since the last reset). Otherwise, LLMs are stateless (i.e., they do not have persistent memory of previous interactions). When using Scholaris, the conversation history is automatically stored in an attribute of the Assistant class, called conversation_history. Be sure to check out the documentation on how to access and use the conversation history. Importantly, the LLM does not “see” the source code for execution. Instead, it receives a description of the purpose and usage of a code element, like a Python function, usually provided as text formatted using JSON. The content of this JSON-formatted string is equivalent to the “docstring”. It is the programmer’s responsibility to ensure proper functionality. Based on the user prompt, the LLM returns the most suitable function name and its parameters for the Python interpreter to execute.
Execution of the selected tool / function: The Python code for the selected function or tool (and any nested functions) is executed, using optional or required parameters provided by the LLM. These functions may be designed to retrieve data from external databases, extract information from local files, or perform tasks like listing the contents of a specific directory. Scholaris includes several built-in tools that can be called by the LLM, such as:
- get_file_names
- extract_text_from_pdf
- get_titles_and_first_authors
- summarize_local_document
- describe_python_code
- id_converter_tool
- query_openalex_api
- query_semantic_scholar_api
- respond_to_generic_queries
Response generation: Finally, the LLM is “called” again to generate a response based on the output of the executed function, the user prompt, and the conversation history as context (which also includes the system message). A programmer can extend this step by implementing additional routines and logic, such as:
- Self-reflection - a mechanism that can be implemented, allowing the LLM to evaluate the response it generated for accuracy and completeness, and repeat the response generation, if necessary.
- Iterative loops - can be implemented, which call additional functions or prompt the user for more details, creating an iterative process by which the response is refined.
- Multi-step problem solving - for complex queries. A workflow can be designed that breaks down tasks into multiple steps. Different functions might be called in sequence or in parallel to gather all necessary information before a comprehensive response is formulated.
- Integration of multiple function outputs - can be combined, allowing information from different sources to be synthesized to provide a more holistic answer.

These additional steps are not part of the core functionality of Scholaris and would need to be implemented by the user.

Use cases

Let’s explore a few practical use cases:

1. Extracting information and summarizing content from local files

By default, the assistant has access to a single directory, called data. Within this directory, the assistant can list and read the following file formats and extensions: .pdf, .txt, .md or .markdown, .csv, and .py. If not already present, the data directory is created in the parent directory when the assistant is initialized.

Extracting information from local files is particularly useful for content that contains sensitive information (e.g., your local files might contain identifying information of study subjects) or for content that is outside your main area of expertise (e.g., a Python script for data analysis obtained from a colleague or collaborator, or a medical chart with diagnostic codes). Additionally, it is useful for documents that are very technical in nature or otherwise difficult to read.

Let’s use the source code of Scholaris as an example! To extract and summarize the content of the source code file, you must first copy it to your local data directory. You can do this using your file manager (e.g., Finder or Explorer), in the terminal, or by using the Python shutil module, like so:

import shutil
shutil.copy("path/to/scholaris/core.py", "data/core.py") # Make sure to replace "path/to/scholaris/core.py" with the actual path to the file.

Now you can ask the assistant to summarize the content of the file:

response = assistant.chat("Summarize the content of the file `core.py` in the local `data` directory.")

You may also ask the assistant to list the contents of the data directory:

response = assistant.chat("List the contents of the local `data` directory.")

There are numerous ways to customize the assistant to suit your needs. In the next section, we will explore these possibilities in more detail. For now, let’s illustrate another use case.

2. Retrieving citation metrics from an external source, such as the OpenAlex API

The assistant can query the OpenAlex API to retrieve citation metrics for a given Digital Object Identifier (DOI). This is particularly helpful if you want to include citation metrics in your literature search or when you need to quickly assess the impact of a specific article. Here’s an example:

response = assistant.chat("How often has the article with the DOI `10.1172/jci.insight.144499` been cited?")

This will prompt the assistant to query the OpenAlex API and return the citation metrics for the specified DOI.

Customizing the assistant for your needs

Tip: To customize the assistant, it is helpful to have a basic understanding of Python programming. If you are new to Python, consider taking a beginner course, such as AI Python for Beginners, a free short course offered by DeepLearning.AI (approximate time to complete: 4-5 hours). Basic knowledge of how to define a function in Python is all you need to customize the assistant with new tools.

Scholaris is designed to be highly customizable, allowing you to extend its capabilities to suit your specific research needs. The core tools or functions are passed to the assistant like “building blocks” during initialization. Therefore, there is no need to modify the source code and Assistant class in order to expand the tools. There are several ways to customize the assistant:

1. Limiting or replacing the core functions

If you want to change the core functions, you can do so by passing the desired core functions as an argument (in the form of dictionaries) to the Assistant class when it is initialized. For example, to limit the assistant’s ability to respond to generic questions and access external data from the OpenAlex and Semantic Scholar APIs, you would initialize the assistant as follows:

assistant = Assistant(tools = {
    "query_openalex_api": query_openalex_api,
    "query_semantic_scholar_api": query_semantic_scholar_api,
    "respond_to_generic_queries": respond_to_generic_queries,
	"describe_tools": describe_tools
    })

When the assistant is initialized in this way, it will no longer be able to access information from the local data directory or extract information from local files, even though the data directory is still present.

Similarly, you can initialize the assistant to only be able to extract information from local files and summarize the content of local documents:

assistant = Assistant(tools = {
	"get_file_names": get_file_names,
	"extract_text_from_pdf": extract_text_from_pdf,
	"summarize_local_document": summarize_local_document,
	"describe_python_code": describe_python_code,
	"respond_to_generic_queries": respond_to_generic_queries,
	"describe_tools": describe_tools
	})

When the assistant is initialized in this way, it will no longer be able to make API calls to external sources.

It is recommended to keep the describe_tools function and respond_to_generic_queries function in the core tools to maintain the assistant’s ability to describe its tools (including newly added tools) and respond to generic queries, respectively. The latter tool also represents a fallback mechanism in case the assistant is unable to identify the user’s intent or the user’s query is outside the scope of the core tools. When using Scholaris, the research assistant is designed to use a tool to generate a final response to a user’s prompt. This is to ensure that the assistant is primarily providing information which is relevant for health and life sciences. Otherwise it will abort the conversation. Be sure to check out the tools section in the documentation for more details (see callout box: What happens if the assistant is initialized without any tools?)

2. Adding new tools

You can also add new tools to the assistant to extend its capabilities. In this case, the core tools will be appended, not replaced. This is achieved simply by defining a new function in Python. Be sure to use type hints, Google-style docstrings, and the @json_schema_decorator function of the Scholaris Python package to automatically generate the schema for your new function. Your new function can then be passed to the Assistant class during initialization, like so:

assistant = Assistant(add_tools = {"your_new_function": your_new_function})

More details can be found in the Developer Guide section of the documentation.

Let’s revisit a few key points and ideas to consider when using the Scholaris package and customizing the assistant:

Many other libraries and Software Development Kits (SDKs) require you to write JSON schemas of the functions or tools to be called by the LLM. This is simplified in Scholaris by using the @json_schema_decorator and Google-syle docstrings, which are easy to write and read.
Scholaris is designed so that functions or tools are passed as building blocks during initialization of the assistant. Therefore, there is no need to modify the source code of the Assistant class in order to expand its capabilities, unless you want to implement more complex logic and agentic workflows, including multi-step reasoning and/or loops (more on this below).
Scholaris is developed to serve as a framework for building LLM-powered research assistants in health and life sciences rather than a robust production-ready tool. Therefore, you may also modify the existing tools and functions or add similar tools to extend the assistant’s capabilities. Consider using the assistance of an LLM to modify the provided core functions to suit your specific research needs. To do so, you may use larger cloud-hosted LLMs to aid you in this process, although for simple modifications, smaller (local) models may suffice. For example, you can modify the summarize_local_document function to extract specific information from a document that is relevant to your research by modifying the prompt used inside the function.
Always use LLMs responsibly and be aware of their limitations. Use additional models, such as Llama Guard, in production environments to ensure that the assistant does not generate harmful or inappropriate content.

3. Implementing more complex logic and agentic workflows (for advanced users)

If you want to implement more complex logic and agentic workflows, such as multi-step reasoning, iterative loops, or self-reflection, you will need to modify the source code of the Assistant class. The Scholaris package has been written using a ‘literate’ programming style and nbdev, which means that the source code is written in a way that is easy to read and understand. This makes it easier for you to modify the source code to suit your specific needs. Be sure to also view the Jupyter notebook with the ‘literate’ source code and additional tests here.

Wrapping up

In this tutorial, you learned how to set up a LLM-powered research assistant for health and life sciences. Be sure to check out the documentation pages for more details on how to use the Scholaris Python package. Consider it as an application to help you accelerate your research aimed at creating a positive impact, and keep the limitations of LLMs in mind:

Current AI systems lack several essential characteristics of human-level intelligence, including the ability to learn, navigate, and understand the physical world, persistent memory, the ability to plan complex action sequences, and the ability to be controllable and safe by design (not by fine-tuning). cf. Yann LeCun - Keynote at the Hudson Forum

If you spotted any errors or inconsistencies in this tutorial, please feel free to open an issue on the GitHub repository’s issue page.

Streamlining Full-Text Article Retrieval for Research

2024-08-14T00:00:00+00:00

Welcome to the first in a series of short tutorials aimed at making LLM-powered applications more accessible for health and life sciences researchers. This tutorial introduces Python utility functions for interacting with the OpenAlex API, a comprehensive, open-access catalog of global research named after the ancient Library of Alexandria and made by the nonprofit OurResearch.

While other excellent community libraries exist for querying the OpenAlex API, my focus here is on functions tailored for streamlining the retrieval of full-text articles indexed in PubMed and leveraging OpenAlex’s extensive citation data. These features are particularly valuable for Retrieval-Augmented Generation (RAG) applications, which can enhance language model performance and improve response quality.

Check out the accompanying Jupyter notebook to run all examples described below. Note that we use magic commands (%) or (!) to run shell commands within Jupyter notebooks.

Installation and setup

(1) Install the required third-party libraries:

requests
tqdm
selenium
webdriver-manager
nbformat
plotly

In a Jupyter notebook environment, simply install these libraries using line magics and pip, like so:

%pip install -qU requests
%pip install -qU tqdm
%pip install -qU selenium
%pip install -qU webdriver-manager
%pip install -qU nbformat
%pip install -qU plotly

(2) Download the file named openalex_api_utils.py from the following GitHub repo, and save it to your working directory (e.g., the same directory from which you run the accompanying notebook). The openalex_api_utils.py file contains all utility functions described below and in the accompanying notebook.

In Colab or any other Jupyter notebook environment:

!wget -q https://raw.githubusercontent.com/nicomarr/public-tutorials/main/openalex_api_utils.py

Or in a terminal emulator:

wget https://raw.githubusercontent.com/nicomarr/public-tutorials/main/openalex_api_utils.py

If wget is not installed, you may also use curl:

curl -O https://raw.githubusercontent.com/nicomarr/public-tutorials/main/openalex_api_utils.py

(3) Import the utility functions:

from openalex_api_utils import *

(4) Add your email address to the environment variables. In Google Colab, open the side panel, click on the ‘key’ icon and add a key-value pair with the key ‘EMAIL’ (all UPPERCASE, no dash) and your email address as the value, then enable notebook access. See the section below and this link for more details. Your email address is sent as part of the request to the OpenAlex API. This is a common and polite practice that helps speed up response times when making many API calls. It also helps developers contact you if there are any issues. For more details, follow this link.

(5) Load your email address from the environment variables. In Google Colab, you can do that by running the following commands after you have added your email address in the Secrets tab, as described above.

from google.colab import userdata
EMAIL = userdata.get("EMAIL")

If you work with in a Jypyter notebook environment on your local computer, you can import environment variables using the os module:

import os
EMAIL = os.environ["EMAIL"]

Alternatively, you may also just define it directly, like so:

EMAIL = "REPLACE_WITH_YOUR_EMAIL@example.com"

However, it is best practice to keep sensitive information like email addresses out of the code.

Basic usage

First, create a list containing unique identifyers of the works to retrieve information about. In OpenAlex, works can be PubMed articles, books, datasets, and theses. In this tutorial, we first get information about 3 PubMed articles using a unique identifyer for each article. Unique identifyers can be an OpenAlex ID, a PubMed ID (PMID), or a Digital Object Identifier (DOI).

uids = ['https://openalex.org/W4387665659', '33497357', '10.1126/sciimmunol.aau8714']

The first utility function we will be using is get_works(). Here, we pass in as argument the list with the unique identifiers, and the email address we have imported from the environment variables. Note that DOIs are accepted with or without a https://doi.org/ prefix, and OpenAlex IDs are accepted with or without a https://openalex.org/ prefix. We can also set a third (optional) argument, show_progress=True, to show a progress bar. The function returns two list objects. The first list object (which we will name ‘works’) contains all the information retrieved from the OpenAlex API. The second list object contains only messages in case anything goes wrong, for example, if one or more IDs provided do not exist in the database. If everything is fine, this list will be empty.

works, failed_calls = get_works(ids=uids, email=EMAIL, show_progress=True)
print("All works were successfully retrieved.") if len(failed_calls) == 0 else print("Some of the works could not be retrieved.")

Output:

Retrieving works: 100%|██████████| 3/3 [00:01<00:00,  1.92it/s]
All works were successfully retrieved.

We can access the retrieved metadata simply by indexing into the works object, which is a list of dictionaries. The data obtained from the OpenAlex API is stored under the key ‘metadata’. The next line prints the title of the first work in the list (for those new to Python, indexing starts at 0).

print(works[0]['metadata']['title'])

Output:

Harnessing large language models (LLMs) for candidate gene prioritization and selection

To list all the works succesfully retrieved, we can use the list_works()function. This will display selected metadata of the retrieved articles in html format, including first author, title, journal, publication year, how many times it has been cited, the number of references, and 10 related works (which can also be retrieved from the OpenAlex API). Note that the symbols in the output indicate whether the article is open access or not, and whether the full text is available or not.

list_works(works)

Output (html):

Toufiq et al. Harnessing large language models (LLMs) for candidate gene prioritization and selection. Journal of Translational Medicine 2023
Cited by: 10 | References: 64 | Related works: 10
Download PDF   Read Full Text   🔓   📖
 
Khan et al. Distinct antibody repertoires against endemic human coronaviruses in children and adults. JCI Insight 2021
Cited by: 53 | References: 70 | Related works: 10
Download PDF   Read Full Text   🔓   📖
 
Boisson‐Dupuis et al. Tuberculosis and impaired IL-23–dependent IFN-γ immunity in humans homozygous for a common TYK2 missense variant. Science Immunology 2018
Cited by: 152 | References: 99 | Related works: 10
Download PDF   Read Full Text   🔓   📖

Download PDF files

Before we proceed with downloading PDF files, please read the following copyright notice:

Copyright Notice: Downloading PDFs may be subject to copyright restrictions. Users are responsible for ensuring they have the right to access and download the content. Always respect the terms of use of the content providers and adhere to applicable copyright laws. See the following README.md file for further details.

We can pass an additional argument to the get_works() function to save the PDF files a specified directory, like so:

works, failed_calls = get_works(ids=uids, email=EMAIL, pdf_output_dir="./pdfs", show_progress=True)
print(f"Requests: {len(uids)}\nRetrieved: {len(works)}\nPDF files downloaded: {len([work for work in works if work['pdf_path'] is not None])}\nFailed calls: {len(failed_calls)}")

Output:

Retrieving works: 100%|██████████| 3/3 [00:09<00:00,  3.25s/it]
Requests: 3
Retrieved: 3
PDF files downloaded: 2
Failed calls: 0

The PDF files can then be used for parsing the full text, tables, and figures of the articles for retrieval augmented generation. All this will be explained in upcoming tutorials. For now, let’s pay attention to the PDFs and notice that only two PDF files were successfully downloaded, even though all three articles are open access. This is because some publishers have put requirements in place that force us to use a web browser to download the PDFs. We will automate this in a later step. Let’s first just inspect the output. Each element in the returned list object has the follwoing dictionary keys:

print(works[0].keys())

Output:

dict_keys(['uid', 'entry_types', 'metadata', 'pdf_path', 'status_messages', 'persist_datetime'])

We can get status messages for each work using the ‘status_messages’ key.

for work in works:
    print(f"Title: {work['metadata']['title'][:80]}...\nStatus messages: {work['status_messages']}\n")

Output:

Title: Harnessing large language models (LLMs) for candidate gene prioritization and se...
Status messages: 2024-08-14: Successfully retrieved metadata with UID W4387665659. 2024-08-14: PDF saved to ./pdfs/37845713_10.1186#s12967-023-04576-8_W4387665659.pdf. 

Title: Distinct antibody repertoires against endemic human coronaviruses in children an...
Status messages: 2024-08-14: Successfully retrieved metadata with UID 33497357. 2024-08-14: PDF saved to ./pdfs/33497357_10.1172#jci.insight.144499_W3125794218.pdf. 

Title: Tuberculosis and impaired IL-23–dependent IFN-γ immunity in humans homozygous fo...
Status messages: 2024-08-14: Successfully retrieved metadata with UID 10.1126/sciimmunol.aau8714. 2024-08-14: Failed to download PDF from https://immunology.sciencemag.org/content/immunology/3/30/eaau8714.full.pdf. Status code: 403. Selenium disabled. 

We can also get the paths to the PDFs that were downloaded by using the pdf_path key. Note that each PDF file is saved using the following naming convention:

{PMID}_{DOI}_{OpenAlex ID}.pdf

with / replaced by #. The value is None if the PDF file was not downloaded.

for work in works:
    print(f"File path: {work['pdf_path']}\n")

Output:

File path: ./pdfs/37845713_10.1186#s12967-023-04576-8_W4387665659.pdf

File path: ./pdfs/33497357_10.1172#jci.insight.144499_W3125794218.pdf

File path: None

Thanks to the Selenium Browser Automation Project, we can automate web browsers. This additional functionality requires the function to be run in a environment with the Google Chrome Browser installed (e.g, in a virtual machine or on your local computer). Therefore, it will not work in the Google Colab environment.

First, let’s remove any downloaded files from the previous run to give us a clean slate.

!rm -rf ./pdfs
print("Removed pdfs directory and all its contents.")

Output:

Removed pdfs directory and all its contents.

Now, let’s rerun the get_works() function with an additional (optional) argument, namely enable_selenium set to True. This will enable the Selenium browser automation tool to be used in the background to retrieve the full text PDFs of the works that cannot be retrieved using the requests library.

works, failed_calls = get_works(uids, email=EMAIL, pdf_output_dir="./pdfs", enable_selenium=True, show_progress=True)
print(f"Requested: {len(uids)}\nRetrieved: {len(works)}\nWith PDFs: {len([work for work in works if work['pdf_path'] is not None])}")
print(f"Failed calls: {len(failed_calls)}")

Output:

Retrieving works: 100%|██████████| 3/3 [00:12<00:00,  4.23s/it]
Requests: 3
Retrieved: 3
PDF files downloaded: 3
Failed calls: 0

Some publishers require PDFs to be accessed via a browser with a visible user interface. When enable_selenium set to True, the default option is to invoke the browser to run as a background process (i.e., in headless mode). By passing in an additional (optional) argument, is_headless=False, we can fully automate a web browser. This will cause a web browser window to automatically open and close for each article that cannot be downloaded using the requests library.

Persist & load metadata

In addition to downloading PDF files, we can pass in an optional argument to the get_works()function to save the metadata to a specified directory. In doing so, the metadata for each article will be saved as a separate JSON file, using a similar naming convention as for the PDF files. The metadata can then be used later then querying an index during retrieval augmentated [text] generation. This will be the focus of a future tutorial. For now, let’s run the following code to demonstrate this additional functionality:

works, failed_calls = get_works(uids, email=EMAIL, pdf_output_dir="./pdfs", persist_dir="./cache", show_progress=True)
%ls ./cache

Output:

Retrieving works: 100%|██████████| 3/3 [00:09<00:00,  3.10s/it]
30578352_10.1126#sciimmunol.aau8714_W2906653622.json
33497357_10.1172#jci.insight.144499_W3125794218.json
37845713_10.1186#s12967-023-04576-8_W4387665659.json

Works can be loaded from storage using the load_works_from_storage() function, simply by providing the path to the directory where the JSON files of the works are stored. This function returns a list of works, similar to the get_works() function. When sorted by the uid (or alternatively, by using persist_datetime as key), we can assert that both list objects are equal.

works_from_storage = load_works_from_storage(persist_dir="./cache")
works.sort(key=lambda x: x['uid'])
works_from_storage.sort(key=lambda x: x['uid'])
assert works == works_from_storage

Note that the get_works() function also uses the load_works_from_storage() function to check the cache first before making a request to the API; that is, if the storage location is specified using the persist_dir argument. If a work is found in the cache, it is returned directly. This speeds up the process and reduces the number of API calls made. We can illustrate this by running the get_works function again with the same uids. Before the first call, we will clear the cache directory to ensure that the works are retrieved from the API. Note the ~200x speedup when executing the function a second time.

%rm -rf ./cache
_works, _ = get_works(uids, email=EMAIL, persist_dir="./cache", show_progress=True)
_works, _ = get_works(uids, email=EMAIL, persist_dir="./cache", show_progress=True)
%ls ./cache

Output:

Retrieving works: 100%|██████████| 3/3 [00:01<00:00,  2.41it/s]
Retrieving works: 100%|██████████| 3/3 [00:00<00:00, 536.10it/s]

To get further help on the get_works() function and to see all arguments available, execute help(get_works).

Get citations

Next, we want to get all articles that have cited any of the 3 articles for which we obtained the PDFs and metadata in the first place. We can do so by using the get_citations() function, which accepts largely the same arguments as the get_works() function, with two key differences:

We pass in the works object (output of the get_works() function) directly.
The process for the API call is slightly different (hence we use a separate function). This this is not important here.

The output is largely the same as for the get_works() function, with the difference that the value for entry_types is automatically set to “citing primary entry”. This will allow us to differentiate between the primary articles and the articles that cite them. Moreover, the function returns single list object, not a tuple. The basic usage is as follows:

citations = get_citations(works, email=EMAIL, show_progress=True)

Output:

Retrieving citations: 100%|██████████| 3/3 [00:03<00:00, 1.32s/it]
Processing citations: 100%|██████████| 216/216 [00:00<00:00, 1332308.33it/s]

Reminder: When using the get_citations() function to download PDFs, please be aware of potential copyright restrictions. Ensure you have the right to access and download the content, and always respect the terms of use of the content providers. Refer to the Copyright Notice in the following README.md file for more details.

To download PDFs and store the metadata in a cache directory, we can pass in the pdf_output_dir and persist_dir arguments, like so:

citations = get_citations(works, email=EMAIL, pdf_output_dir="./pdfs", persist_dir="./cache", show_progress=True)
print(f"Citations retrieved: {len(citations)}\nPDF files downloaded: {len([work for work in citations if work['pdf_path'] is not None])}")

Output:

Retrieving citations: 100%|██████████| 3/3 [00:02<00:00,  1.19it/s]
Processing citations: 100%|██████████| 222/222 [00:00<00:00, 152570.13it/s]
Retrieving PDFs: 100%|██████████| 222/222 [06:00<00:00,  1.62s/it]
Persisting data: 100%|██████████| 222/222 [00:00<00:00, 624.10it/s]
Citations retrieved: 222
PDF files downloaded: 102

We can also enable the Selenium WebDriver and automate Chrome in headless or standard mode. This is done the same way as for the get_works() function.

citations = get_citations(works, email=EMAIL, pdf_output_dir="./pdfs", persist_dir="./cache", enable_selenium=True, is_headless=False, show_progress=True)
print(f"Citations retrieved: {len(citations)}\nPDF files downloaded: {len([work for work in citations if work['pdf_path'] is not None])}")

Output:

Retrieving citations: 100%|██████████| 3/3 [00:02<00:00,  1.15it/s]
Processing citations: 100%|██████████| 222/222 [00:00<00:00, 545800.40it/s]
Retrieving PDFs: 100%|██████████| 222/222 [09:37<00:00,  2.60s/it]
Persisting data: 100%|██████████| 222/222 [00:00<00:00, 616.35it/s]
Citations retrieved: 222
PDF files downloaded: 151

The remaining citations for which the PDFs could not be downloaded have to be retrieved manually. Most of them are not open access. We will get back to this later; for now, let’s get the references and related works as a next step.

Now, let’s retrieve all references and related works for the three articles we obtained earlier. We’ll use list comprehensions to gather this information efficiently. First, we’ll collect the references for each article. References are the works cited by our original articles. We’ll then flatten the resulting list of lists into a single list of reference IDs. Next, we’ll gather the related works. Related works are identified through an algorithmic process that selects recent papers sharing the most conceptual similarities with a given paper. This selection may include preprints from bioRxiv, which might not yet be indexed in PubMed.

references_ids = [work['metadata']['referenced_works'] for work in works] # List comprehension
references_ids = [item for sublist in references_ids for item in sublist] # Flatten the lists
related_works_ids = [work['metadata']['related_works'] for work in works] # List comprehension
related_works_ids = [item for sublist in related_works_ids for item in sublist] # Flatten the lists

Now we can use the get_works() function in the way that allowed us to retrieve the metadata and PDF files of the 3 articles in the first place. Note the additional (optional) arguments that we pass to the get_works() function, as before. Specifically, we pass values for the persist_dir and pdf_output_dir arguments, which will determine if and where we save the metadata for each article and PDF files to disk. This will save us time in the future if we want to access the metadata for the works again.

We also specify a field called entry_type, which indicates the type of entry we are retrieving. This field will be usefull later when we want to get information about how we retrieved the metadata for each work in the first place. This time, it is not necessary to store the output of the failed calls. Since we will pass in output from the get_works() function, all IDs used as input here must be valid IDs.

For now, we will retrieve the references and related works, and download PDFs with the Selenium WebDriver disabled. This can be done in Colab.

references, _ = get_works(references_ids, email=EMAIL, pdf_output_dir="./pdfs", entry_type="reference of primary entry", show_progress=True)
related_works, _ = get_works(related_works_ids, email=EMAIL, pdf_output_dir="./pdfs", entry_type="related to primary entry", show_progress=True)
print(f"References retrieved: {len(references)}\nPDF files downloaded: {len([work for work in references if work['pdf_path'] is not None])}")
print(f"Related works retrieved: {len(related_works)}\nPDF files downloaded: {len([work for work in related_works if work['pdf_path'] is not None])}")

Output:

Retrieving works: 100%|██████████| 233/233 [05:31<00:00,  1.42s/it]
Retrieving works: 100%|██████████| 30/30 [00:37<00:00,  1.26s/it]
References retrieved: 233
PDF files downloaded: 74
Related works retrieved: 30
PDF files downloaded: 12

As described above, we can save the metedata to disk. In addition, we can set enable_selenium=True and is_headless=False to enable the Selenium WebDriver with Chrome in standard mode, which will allow us to retrieve more PDF files. This additional functionality requires the function to be run in an environment with the Google Chrome Browser installed (e.g, in a virtual machine or on your local computer). Therefore, it will not work in the Colab environment. Also note that PDF files of articles which are not open access are not downloaded.

references, _ = get_works(references_ids, email=EMAIL, pdf_output_dir="./pdfs", persist_dir="./cache", entry_type="reference of primary entry", enable_selenium=True, is_headless=False, show_progress=True)
related_works, _ = get_works(related_works_ids, email=EMAIL, pdf_output_dir="./pdfs", persist_dir="./cache", entry_type="related to primary entry", enable_selenium=True, is_headless=False, show_progress=True)
print(f"References retrieved: {len(references)}\nPDF files downloaded: {len([work for work in references if work['pdf_path'] is not None])}")
print(f"Related works retrieved: {len(related_works)}\nPDF files downloaded: {len([work for work in related_works if work['pdf_path'] is not None])}")

Output:

Retrieving works: 100%|██████████| 233/233 [11:49<00:00,  3.05s/it]
Retrieving works: 100%|██████████| 30/30 [00:29<00:00,  1.03it/s]
References retrieved: 233
PDF files downloaded: 147
Related works retrieved: 30
PDF files downloaded: 12

Finally, we can print the total number of works retrieved, which of them are open access, and the total number of PDF files downloaded.

total_works = works + citations + references + related_works
print("Total number of works retrieved:", len(total_works))
print("Total number of open access works:", len([work for work in total_works if work['metadata']['open_access']['is_oa']]))
print("Total number of PDF files downloaded:", len([work for work in total_works if work['pdf_path'] is not None]))

Output:

Total number of works retrieved: 488
Total number of open access works: 386
Total number of PDF files downloaded: 312

To access the status messages of works where a PDF file could not be retrieved, we can use the following code snippet:

for work in total_works:
    if work['pdf_path'] is None:
        print(f"Title: {work['metadata']['title'][:80]}...\nStatus messages: {work['status_messages']}\nDOI: {work['metadata']['ids'].get('doi', 'None')}\n")

Output:

Title: Autoimmune pathways in mice and humans are blocked by pharmacological stabilizat...
Status messages: 2024-08-14: PDF download from https://stm.sciencemag.org/content/scitransmed/11/502/eaaw1736.full.pdf using Selenium with headless mode set to False failed. 
DOI: https://doi.org/10.1126/scitranslmed.aaw1736

Title: Mendelian susceptibility to mycobacterial disease: recent discoveries...
Status messages: 2024-08-14: PDF URL not found in API call response. Skipped PDF download. 
DOI: https://doi.org/10.1007/s00439-020-02120-y

...

Title: Workflow Analysis using Graph Kernels....
Status messages: 2024-08-14: Successfully retrieved metadata with UID W2182707996. 2024-08-14: Work with UID https://openalex.org/W2182707996 is not open access or 'best_oa_location' key not found. Skipped PDF download. 
DOI: None

Title: Automating Radiologist Workflow, Part 2: Hands-Free Navigation...
Status messages: 2024-08-14: Successfully retrieved metadata with UID W2029380707. 2024-08-14: Work with UID https://openalex.org/W2029380707 is not open access or 'best_oa_location' key not found. Skipped PDF download. 
DOI: https://doi.org/10.1016/j.jacr.2008.05.012

Be sure to check out the accompanying Jupyter notebook, which also includes a bonus feature to visualize open access statistics for the retrieved works.

Wrapping up

In this tutorial, we explored using Python utility functions to interact with the OpenAlex API for retrieving full-text articles and leveraging citation data. Key points include:

Retrieving metadata and downloading PDF files using OpenAlex IDs, PMIDs, or DOIs.
Obtaining citations, references, and related works for articles.
Persisting metadata and automating PDF downloads with Selenium WebDriver.

We demonstrated the efficiency of this approach by automating the download of 312 PDF files out of 386 open access works, from a total of 488 works retrieved. Key takeaways:

Subscriptions are needed for non-open access content.
Use Unpaywall for open access versions of articles not automatically downloaded.
Check the status_messages field for information on unretreived full-text content.
Google Colab users should download data before closing sessions.
PDF files are named using the convention: {PMID}{DOI}{OpenAlex ID}.pdf.

These utility functions provide a foundation for automating full-text article retrieval and metadata collection. Future tutorials will explore text analysis, information extraction, and integration with language models.

If you encounter any bugs in the code, have suggestions for improvements, or would like to request new features, please submit an issue at my GitHub repo. Your feedback is valuable for improving these tools for the research community.

Home

Building With AI Coding Agents While Keeping Your Data Safe

My Motivation: Security That Doesn’t Kill Productivity

Understanding the Risk: The Lethal Trifecta

The Security Concept: Defense-in-Depth

1. Network Firewall: Blocking Unauthorized Exfiltration

2. Content Segregation: Trusted vs. Untrusted

3. Human Review: The Critical Layer

4. Tool Selection: Use Less Powerful Tools When Possible

What’s Actually in the Template

AI Development Tools

Development Productivity Tools

Development Environments

Infrastructure

Agent instructions

Weaknesses: Let’s Be Honest About Limitations

Conclusion: An Invitation to Explore

Reference and Additional Resources

Simon Willison’s weblog content on prompt injection

AI Development Tools

Enhancement and Productivity Tools

Open Source Scripts Adapted for This Template

Set up a Local LLM-powered Research Assistant for Health & Life Sciences

Getting started

Function / tool calling

Use cases

1. Extracting information and summarizing content from local files

2. Retrieving citation metrics from an external source, such as the OpenAlex API

Customizing the assistant for your needs

1. Limiting or replacing the core functions

2. Adding new tools

3. Implementing more complex logic and agentic workflows (for advanced users)

Wrapping up

Streamlining Full-Text Article Retrieval for Research

Installation and setup

Basic usage

Download PDF files

Persist & load metadata

Get citations

Get references and related works

Wrapping up