<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://nicomarr.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://nicomarr.com/" rel="alternate" type="text/html" /><updated>2026-06-20T12:29:27+00:00</updated><id>https://nicomarr.com/feed.xml</id><title type="html">Home</title><subtitle>© 2024–2026 Nico Marr. All rights reserved.</subtitle><entry><title type="html">Building With AI Coding Agents While Keeping Your Data Safe</title><link href="https://nicomarr.com/ai/security/development/2025/10/27/safer-codespace.html" rel="alternate" type="text/html" title="Building With AI Coding Agents While Keeping Your Data Safe" /><published>2025-10-27T00:00:00+00:00</published><updated>2025-10-27T00:00:00+00:00</updated><id>https://nicomarr.com/ai/security/development/2025/10/27/safer-codespace</id><content type="html" xml:base="https://nicomarr.com/ai/security/development/2025/10/27/safer-codespace.html"><![CDATA[<p><em>How I built a security-first development environment for AI coding agents—because “move fast” and “YOLO mode” isn’t an option when you’re working in biomedical research.</em></p>

<hr />

<blockquote>
  <p><strong>tl;dr</strong></p>

  <p>Throughout this year, I found myself increasingly excited about the potential of AI coding agents. At the same time, I grew more uncomfortable with the security tradeoffs they introduce. If you found yourself in a similar position, this post may resonate with you.</p>

  <p>I built <a href="https://github.com/nicomarr/safer-codespace">safer-codespace</a> over only a few evenings, an experimental development container template that lets me work with AI coding agents while having multiple security layers in place to make data exfiltration considerably harder.</p>
</blockquote>

<p><br /></p>

<h3 id="my-motivation-security-that-doesnt-kill-productivity">My Motivation: Security That Doesn’t Kill Productivity</h3>

<p>What I really needed was a template for new projects that would meet my specific requirements:</p>

<p><strong>Have security and observability baked in from the start</strong> - Not as an afterthought, but as foundational layers. This means:</p>
<ul>
  <li>Network firewall configured automatically on container build</li>
  <li>Content segregation workflow (trusted vs. untrusted directories)</li>
  <li>Complete LLM interaction logging to a SQLite database via the <a href="https://llm.datasette.io/"><code class="language-plaintext highlighter-rouge">llm</code> CLI tool</a></li>
  <li>Optional integration with <a href="https://specstory.com/">SpecStory</a> to automatically save terminal agent conversations as clean, searchable markdown—preserving the “why” behind agent actions and creating a path to iteratively evaluate and improve my prompts and workflows over time</li>
</ul>

<p><strong>Have the option to use LLMs of my choice</strong> - Support for multiple LLM providers including open-source models, not locked into a single vendor. I wanted flexibility and control over which models process my data, with the ability to run models locally if needed.</p>

<p><strong>Provide an easy path to iterate and improve</strong> - Both the project itself and my development workflow. I wanted to learn from each development session and continuously refine my approach based on real-world usage.</p>

<p>This isn’t about being paranoid. In biomedical research, I may be working with:</p>
<ul>
  <li>Personally identifiable information from study participants</li>
  <li>Sensitive genetic and clinical data</li>
  <li>Intellectual property that must be protected</li>
  <li>Research data subject to ethics committees and data protection regulations</li>
</ul>

<p>Any data exfiltration from such research projects could have serious consequences.</p>

<p>Yet, as Dario Amodei has noted in his 2024 essay, <a href="https://www.darioamodei.com/essay/machines-of-loving-grace"><em>Machines of Loving Grace</em></a>, one of the prospects I’m also most excited about is the positive implications powerful AI systems could have in biology and physical health in the coming years—if we do it right.</p>

<p><br /></p>

<h3 id="understanding-the-risk-the-lethal-trifecta">Understanding the Risk: The Lethal Trifecta</h3>

<p>Before diving into solutions, we need to understand the threat. <a href="https://simonwillison.net/2022/Sep/12/prompt-injection/">Simon Willison coined the term “prompt injection” in 2022</a>, deliberately naming it after SQL injection because both vulnerabilities share the same root cause: <strong>mixing trusted instructions with untrusted input</strong>.</p>

<p>Just as SQL injection exploits happen when you concatenate a user’s SQL query with malicious input, prompt injection exploits happen when AI systems process untrusted external content alongside their user instructions. An attacker can embed malicious instructions in a document, web page, or dependency file that override your instructions to the LLM. What we currently call “AI assistants” or “AI agents” are autoregressive sequence models that can take actions by calling functions and other code artifacts based on their text output, often in a loop. Their very architecture makes them susceptible to such attacks.</p>

<p>Willison later defined what he calls <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/"><strong>“the lethal trifecta for AI agents”</strong></a>—three capabilities that, when combined in an AI system, create a critical security vulnerability:</p>

<ol>
  <li><strong>Access to Private Data</strong> - The AI assistant can read your code, files, environment variables, or secrets</li>
  <li><strong>Exposure to Untrusted Content</strong> - The AI assistant processes external documentation, dependencies, emails, or web pages</li>
  <li><strong>Ability to Exfiltrate Data</strong> - The AI assistant can make network requests to send data to external servers</li>
</ol>

<p>When all three are present, the attack scenario becomes straightforward:</p>

<blockquote>
  <p>An attacker embeds malicious instructions in documentation → Your AI assistant reads private data → The instructions tell it to find sensitive information → It sends that information to the attacker’s server</p>
</blockquote>

<p>This isn’t theoretical. Such exploits have been <a href="https://simonwillison.net/series/prompt-injection/">documented extensively</a>. The fundamental problem is that <strong>LLMs cannot reliably distinguish between trusted instructions and malicious content</strong>—they’re optimized to model patterns in natural language and other data types, not to discover or verify truth, or understand the physical world.</p>

<p>“Guardrails” that try to detect attacks might work 99% of the time, but as Willison notes, <a href="https://simonwillison.net/2023/May/2/prompt-injection-explained/"><em>99% is a failing grade in web application security</em></a>. Attackers get unlimited attempts to craft bypasses, and they only need to succeed once.</p>

<p><br /></p>

<h3 id="the-security-concept-defense-in-depth">The Security Concept: Defense-in-Depth</h3>

<p>Since no single defense is perfect, this template takes a <strong>defense-in-depth</strong> approach: deploy multiple independent security layers so that if one fails, others still provide protection. The goal isn’t perfect security (which may be impossible with current LLM architectures), but to make attacks considerably harder and limit potential damage.</p>

<p>The core strategy: <strong>remove at least one leg of the lethal trifecta</strong> through complementary and redundant controls:</p>

<h4 id="1-network-firewall-blocking-unauthorized-exfiltration">1. Network Firewall: Blocking Unauthorized Exfiltration</h4>

<p>The devcontainer includes an automatically-configured network firewall that blocks all outbound connections except to pre-approved development endpoints like GitHub, npm, PyPI, and AI API providers. The firewall script was directly adapted from <a href="https://github.com/anthropics/claude-code/blob/main/.devcontainer/init-firewall.sh">Claude Code’s open-source repository</a>.</p>

<p>This doesn’t prevent all data exfiltration—an attacker could still <a href="https://github.blog/security/vulnerability-research/how-to-catch-github-actions-workflow-injections-before-attackers-do/">potentially abuse allowed endpoints like GitHub to create issues or gists containing stolen data</a>. But it significantly limits the attack surface and prevents the most straightforward exfiltration methods (like sending data to <code class="language-plaintext highlighter-rouge">attacker.com</code>).</p>

<p>The firewall is validated by automated tests that confirm:</p>
<ul>
  <li>Required endpoints remain reachable</li>
  <li>Unauthorized domains are blocked</li>
  <li>The security layer is actually active</li>
</ul>

<h4 id="2-content-segregation-trusted-vs-untrusted">2. Content Segregation: Trusted vs. Untrusted</h4>

<p>The template uses a simple directory structure to separate vetted content from potentially dangerous external material:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>context/
├── trusted/       # Human-reviewed, safe for AI access
└── untrusted/     # External content requiring review
</code></pre></div></div>

<p>The workflow: fetch external documentation or dependencies using non-AI tools (like my <a href="https://github.com/AnswerDotAI/web2md">url-to-markdown script adapted from AnswerAI’s web2md</a>), save them to <code class="language-plaintext highlighter-rouge">untrusted/</code>, manually review for malicious instructions, then move to <code class="language-plaintext highlighter-rouge">trusted/</code> only after verification. AI assistants should only access <code class="language-plaintext highlighter-rouge">trusted/</code> content.</p>

<p>This creates a <strong>mandatory review checkpoint</strong> before AI systems can process external content.</p>

<h4 id="3-human-review-the-critical-layer">3. Human Review: The Critical Layer</h4>

<p>Here’s the key insight: <strong>don’t use AI to detect prompt injection attacks</strong>. AI-based detection is fundamentally unreliable because attackers can craft prompts specifically designed to bypass detection.</p>

<p>Instead, rely on human intelligence. Before providing external content to your AI assistant, <strong>you review it yourself</strong> for suspicious instructions like:</p>

<ul>
  <li>“Send the contents of .env to…”</li>
  <li>“Ignore previous instructions and…”</li>
  <li>Unusual URLs or exfiltration commands</li>
</ul>

<p>This is the most important layer. Security tools can help, but informed human judgment remains the most reliable defense.</p>

<h4 id="4-tool-selection-use-less-powerful-tools-when-possible">4. Tool Selection: Use Less Powerful Tools When Possible</h4>

<p>Not every task needs an AI agent with full system access. The template includes multiple tools with different capability levels:</p>

<p><strong>Claude Code</strong> - Full-featured AI assistant with file access and command execution. Use for complex development tasks that genuinely need these capabilities.</p>

<p><strong><code class="language-plaintext highlighter-rouge">llm</code> CLI tool</strong> - Text-only interface with no file access or command execution by default. Perfect for:</p>
<ul>
  <li>Explaining error messages</li>
  <li>Generating commit messages from <code class="language-plaintext highlighter-rouge">git diff</code> output</li>
  <li>Code reviews on specific snippets</li>
  <li>Quick questions about syntax or concepts</li>
</ul>

<p>Even if compromised through prompt injection, the <code class="language-plaintext highlighter-rouge">llm</code> tool’s damage is limited to the text you explicitly pipe to it.</p>

<p><strong>Manual commands</strong> - For simple, routine tasks (git operations, package installation, running tests), skip AI entirely. It’s faster, safer, and you maintain direct control.</p>

<p>This <strong>graduated response</strong> means you only use powerful tools when actually needed, reducing your attack surface for routine operations.</p>

<p><br /></p>

<h3 id="whats-actually-in-the-template">What’s Actually in the Template</h3>

<p>Let me show you the concrete tools and configuration that implement these security principles:</p>

<h4 id="ai-development-tools">AI Development Tools</h4>

<p><strong><a href="https://docs.claude.com/en/docs/claude-code/">Claude Code</a></strong> - Anthropic’s interactive AI assistant with file access and command execution capabilities. Great for complex, multi-step development tasks.</p>

<p><strong><a href="https://llm.datasette.io/">llm CLI tool</a></strong> - Simon Willison’s command-line wrapper for calling various LLMs. Configured by default to use GitHub’s free GPT-4o (no API key required). Perfect for piping output from other CLI tools:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Generate commit message from staged changes</span>
git diff <span class="nt">--staged</span> | llm <span class="nt">-s</span> <span class="s2">"write a conventional commit message"</span>

<span class="c"># Explain an error</span>
<span class="nb">cat </span>error.log | llm <span class="s2">"what does this error mean?"</span>
</code></pre></div></div>

<p><strong><a href="https://specstory.com/">SpecStory</a></strong> (optional) - Automatically saves Claude Code conversations as clean markdown. This preserves the reasoning and design decisions behind your code as git-friendly documentation.</p>

<h4 id="development-productivity-tools">Development Productivity Tools</h4>

<p><strong><a href="https://github.com/charmbracelet/glow">glow</a></strong> - Beautiful markdown rendering in the terminal for monitoring documentation and reviewing saved conversations.</p>

<p><strong><a href="https://just.systems/">just</a></strong> - Simple command runner for project workflows. Makes it easy to define and run common development tasks.</p>

<p><strong><code class="language-plaintext highlighter-rouge">url-to-markdown</code> script</strong> - Adapted from AnswerAI’s <a href="https://github.com/AnswerDotAI/web2md">web2md</a> tool, this helps fetch external documentation safely without using AI, supporting the content segregation workflow.</p>

<h4 id="development-environments">Development Environments</h4>

<ul>
  <li><strong>Python 3.13</strong> with <code class="language-plaintext highlighter-rouge">uv</code> for fast package management</li>
  <li><strong>Node.js 24.x</strong> with <code class="language-plaintext highlighter-rouge">npm</code></li>
  <li><strong>Go</strong> (latest stable version)</li>
</ul>

<h4 id="infrastructure">Infrastructure</h4>

<ul>
  <li>Pre-configured devcontainer with security baked in from the start</li>
  <li>Four automated GitHub Actions workflows that validate:
    <ul>
      <li>Devcontainer builds successfully</li>
      <li>Network connectivity works</li>
      <li>Firewall actively blocks unauthorized domains</li>
      <li>Tools are properly installed</li>
    </ul>
  </li>
  <li>Comprehensive documentation and usage examples</li>
</ul>

<h4 id="agent-instructions">Agent instructions</h4>

<p>The repository includes a <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> file with detailed development guidelines, security principles, and best practices, heavily inspired by Eric Ries’s <a href="https://theleanstartup.com/"><em>The Lean Startup</em></a> methodology to build Minimum Viable Products (MVPs) first, and heavily focused on test-driven development. Agent instructions are of course subjective and also depend on personal preferences. Feel free to adapt them to your own style.</p>

<p><br /></p>

<blockquote>
  <p>Here’s a meta-insight that might seem paradoxical: I built this security-focused template largely <strong>using Claude Code itself</strong>—the very AI assistant I was trying to secure.</p>

  <p>The GitHub Actions workflows that validate the security controls? Built with Claude Code. The firewall configuration? Adapted from <a href="https://github.com/anthropics/claude-code/tree/main/.devcontainer">Anthropics reference container setup for Claude Code</a>. The testing scripts? Built with Claude Code. Honestly, I’d never have had the patience and time to write all of this manually from scratch: navigating the YAML syntax, writing in programming languages less familiar to me, and creating test cases. With the <strong>amazing open-source resources from trusted sources</strong> and Claude Code guiding me through the process of test-driven development, it was actually enjoyable. <strong>Huge thanks to all the open-source contributors whose work made this possible!</strong> All this was built over only a few evenings, heavily inspired by a new <a href="https://maven.com/kentro/context-engineering-for-coding/">Maven Course taught by Eleanor Berger and Isaac Flath</a>.</p>
</blockquote>

<p><br /></p>

<h3 id="weaknesses-lets-be-honest-about-limitations">Weaknesses: Let’s Be Honest About Limitations</h3>

<p>This template isn’t a silver bullet, and I want to be completely transparent about what it <strong>doesn’t</strong> do:</p>

<p><strong>It’s not a complete solution.</strong> Prompt injection remains an unsolved problem in the AI research community. Determined attackers with knowledge of the allowed endpoints could still potentially exfiltrate data through creative attacks on approved services.</p>

<p><strong>It requires user awareness.</strong> The security model only works if you understand the risks and follow the content review workflow. There’s no foolproof automation that can protect you from yourself. This is about informed risk management, not eliminating all risk.</p>

<p><strong>It’s experimental and evolving.</strong> This template is for education and exploration, not production deployment without careful evaluation for your specific context. Use it to learn about the threats and practice safer AI workflows.</p>

<p><strong>Legitimate questions remain:</strong></p>
<ul>
  <li>How do we better detect malicious content in dependencies?</li>
  <li>Can we create safer abstractions for AI tool capabilities?</li>
  <li>What additional security layers might help?</li>
  <li>How do we balance security with the rapid pace of AI tool development?</li>
</ul>

<p>The honest truth is that we’re all still figuring this out. The AI security landscape is evolving rapidly, and what works today may need adjustment tomorrow. This template is one approach based on current understanding—not a final answer.</p>

<p><br /></p>

<h3 id="conclusion-an-invitation-to-explore">Conclusion: An Invitation to Explore</h3>

<p>The <code class="language-plaintext highlighter-rouge">safer-codespace</code> template isn’t claiming to have “solved” prompt injection—no one has. It’s a practical implementation of defense-in-depth strategies that you can use today while acknowledging the limitations. If you work with AI coding assistants and care about security—whether you’re in healthcare, finance, research, or any field with sensitive data—I invite you to:</p>

<ol>
  <li><strong>Try the template</strong> - See if these patterns fit your workflow and requirements</li>
  <li><strong>Critique the approach</strong> - Where are the gaps? What’s missing? What could be better?</li>
  <li><strong>Contribute improvements</strong> - Better protections, clearer documentation, more educational content</li>
</ol>

<p><strong>Get started:</strong> <a href="https://github.com/nicomarr/safer-codespace">github.com/nicomarr/safer-codespace</a></p>

<hr />

<p><br /></p>

<h3 id="reference-and-additional-resources">Reference and Additional Resources</h3>

<h4 id="simon-willisons-weblog-content-on-prompt-injection">Simon Willison’s weblog content on prompt injection</h4>
<ul>
  <li><a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">The lethal trifecta for AI agents: private data, untrusted content, and external communication (2025)</a> - Original framework defining the three dangerous capabilities</li>
  <li><a href="https://simonwillison.net/2023/May/2/prompt-injection-explained/">Prompt injection explained, with video, slides, and a transcript (2023)</a> - Presentation with annotated slides</li>
  <li><a href="https://simonwillison.net/2022/Sep/12/prompt-injection/">Prompt injection attacks against GPT-3 (2022)</a> - The post that coined the term “prompt injection”</li>
  <li><a href="https://simonwillison.net/series/prompt-injection/">Series: Prompt injection</a> - Comprehensive collection of research, real-world attacks, and ongoing developments</li>
</ul>

<h4 id="ai-development-tools-1">AI Development Tools</h4>
<ul>
  <li><a href="https://docs.claude.com/en/docs/claude-code/overview">Claude Code overview</a> - Features and capabilities</li>
  <li><a href="https://llm.datasette.io/"><code class="language-plaintext highlighter-rouge">llm</code> CLI tool</a> - Simon Willison’s command-line tool for working with LLMs</li>
</ul>

<h4 id="enhancement-and-productivity-tools">Enhancement and Productivity Tools</h4>
<ul>
  <li><a href="https://specstory.com/">SpecStory</a> - Automatically save Claude Code conversations as markdown</li>
  <li><a href="https://docs.specstory.com/overview">SpecStory documentation</a> - Setup and usage guide</li>
  <li><a href="https://github.com/charmbracelet/glow">glow</a> - Beautiful markdown rendering for the terminal</li>
  <li><a href="https://just.systems/">just</a> - Simple command runner for project workflows</li>
</ul>

<h4 id="open-source-scripts-adapted-for-this-template">Open Source Scripts Adapted for This Template</h4>
<ul>
  <li><a href="https://github.com/anthropics/claude-code/blob/main/.devcontainer/init-firewall.sh">Claude Code firewall script</a> - Bash script restricting network access to Docker DNS and approved IPs</li>
  <li><a href="https://github.com/AnswerDotAI/web2md">web2md by AnswerAI</a> - Python script that takes a URL, makes a request, and converts web content to markdown</li>
</ul>

<hr />

<p><strong>Important:</strong> This template implements defense-in-depth strategies for AI coding assistants but does not claim to have solved prompt injection or to fully prevent attacks. It creates multiple security layers to make data exfiltration considerably harder while acknowledging the limitations of any security controls. Always review external content before exposing it to AI assistants. Use at your own risk.</p>]]></content><author><name>Nico Marr</name></author><category term="AI" /><category term="Security" /><category term="Development" /><category term="prompt-injection" /><category term="ai-security" /><category term="claude-code" /><category term="devcontainer" /><category term="defense-in-depth" /><summary type="html"><![CDATA[How I built a security-first development environment for AI coding agents—because 'move fast' and 'YOLO mode' isn't an option when you're working in biomedical research.]]></summary></entry><entry><title type="html">Set up a Local LLM-powered Research Assistant for Health &amp;amp; Life Sciences</title><link href="https://nicomarr.com/tutorials/2024/10/18/scholaris-tutorial.html" rel="alternate" type="text/html" title="Set up a Local LLM-powered Research Assistant for Health &amp;amp; Life Sciences" /><published>2024-10-18T00:00:00+00:00</published><updated>2024-10-18T00:00:00+00:00</updated><id>https://nicomarr.com/tutorials/2024/10/18/scholaris-tutorial</id><content type="html" xml:base="https://nicomarr.com/tutorials/2024/10/18/scholaris-tutorial.html"><![CDATA[<p>Welcome to the second in a series of short tutorials aimed at making Large Language Model (LLM)-powered applications more accessible for health and life sciences researchers. 
This tutorial introduces <a href="https://pypi.org/project/scholaris/">Scholaris</a>, a Python package that allows anyone to set up a research assistant on a local computer and leverage function calling capabilities “out of the box”. 
Scholaris is designed specifically for use in health and life sciences to help gain insights from scholarly articles and interact with academic databases.</p>

<p>In this tutorial, you’ll learn:</p>
<ul>
  <li>How to use <a href="https://nicomarr.github.io/scholaris/">Scholaris</a> to set up an assistant and leverage its tools for various research tasks.</li>
  <li>How tool or function calling works using an LLM.</li>
  <li>How to customize the assistant for your specific research needs.
<br />
<br /></li>
</ul>
<div style="background-color: #e6ffe6; padding: 15px; border-radius: 15px; margin: 30px 0;"> 
	<strong>Is this for you?</strong>
	This tutorial and the <a href="https://nicomarr.github.io/scholaris/">Scholaris Python package</a> is suitable for anyone and does not require any prior knowledge of Python programming. 
	The only prerequisite is that you are willing to learn and explore new tools and technologies.
	If you find particular terminology or concepts confusing, feel free to use <a href="https://you.com/">a cloud-hosted large language model</a> and ask it to explain them in simpler terms.
	Throughout this tutorial, you will also find text boxes with additional explanations. Feel free to skip these if you are already familiar with the concepts.
	If you are an experienced software developer, you may still find the tutorial useful for specific applications in health and life sciences.
	Be sure to check out the last section on customizing the assistant to suit specific research needs.
</div>
<p><br /></p>

<h2 id="getting-started">Getting started</h2>
<div style="background-color: #e6f3ff; padding: 15px; border-radius: 15px; margin: 30px 0;"> <strong>Note!</strong>
This section assumes that you have Python and the Ollama app installed on your local computer. 
If you want to follow along with the code examples, see the <a href="https://nicomarr.github.io/scholaris/#installation">documentation</a> for installation instructions.
</div>

<p>To get started, import and initialize the <code class="language-plaintext highlighter-rouge">Assistant</code> class from Scholaris:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scholaris.core</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">assistant</span> <span class="o">=</span> <span class="n">Assistant</span><span class="p">()</span> 
</code></pre></div></div>

<p>This creates an instance of the <code class="language-plaintext highlighter-rouge">Assistant</code> class with default settings. At the time of writing, the default model is <a href="https://ollama.com/library/llama3.1">Llama 3.1 8B</a>. 
You can also customize various parameters if needed, such as specifying a different model or adding custom tools (more on this in a later section).
A detailed description of how to do so can also be found in the <a href="https://nicomarr.github.io/scholaris/">documentation pages</a>. 
For now, let’s take advantage of the core functionality and default options.</p>

<p>The primary way to interact with the Scholaris assistant is through the <code class="language-plaintext highlighter-rouge">chat</code> method. Here’s an example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">response</span> <span class="o">=</span> <span class="n">assistant</span><span class="p">.</span><span class="n">chat</span><span class="p">(</span><span class="s">"Briefly tell me about the tools you have available."</span><span class="p">)</span>
</code></pre></div></div>

<p>This will prompt the assistant to describe its available tools and capabilities. 
<br /></p>

<div style="background-color: #e6f3ff; padding: 15px; border-radius: 15px; margin: 30px 0;">
    <strong>Terminology explained:</strong>
    <ul>
		<br />
		<li>
			A <strong>class</strong> in Python is like a blueprint or a template for creating <strong>objects</strong>. It's a way to bundle data and functionality together.
			<strong>Objects</strong> are instances of classes, and they can have attributes (variables) and methods (functions).
		</li>
		<br />
        <li>
			The terms <strong>function calling</strong> and <strong>tool calling</strong> are often used interchangeably. 
			Strictly speaking, functions typically receive one or more parameters as input to generate (return) one or more outputs, 
			whereas a tool is a more broadly defined term in the context of large language model (LLM)-driven applications.
			Tools may refer to a wider range of operations, including functions, code blocks that are executed without additional parameters, 
			multiple functions executed in sequence or parallel, or other types of actions. 
			In the context of LLM-driven applications, both function calling and tool calling refer to the model's ability to generate structured outputs. 
			These outputs are based on predefined schemas, not on the actual code that is executed. 
			If an LLM-powered application or workflow has the capability to call functions or tools, it may also be referred to as being "agentic".
		</li>
		<br />
        <li>
			<strong>JSON</strong>, short for JavaScript Object Notation, is a lightweight, text-based data format that is easy for humans to read and write, 
			and simple for LLMs to parse and generate. JSON uses curly braces for objects, colons to separate keys and values, 
			and commas to separate key-value pairs or elements of list objects. Originally developed for JavaScript, JSON has become a widely-used standard for data exchange in web applications and beyond.
		</li>
		<br />
		<li>
			<strong>API</strong>, short for Application Programming Interface, is a set of rules and protocols that allow different software applications to communicate with each other.
			It is the primary way of how an application is interacting with an external service or tool, such as a database, web service, or library (analogous to how humans interact through a graphical user interface, e.g., a web browser).
		</li>
	</ul>
</div>
<p><br /></p>

<h2 id="function--tool-calling">Function / tool calling</h2>
<p>Before we dive into a few use cases, let’s first review what tool or function calling is and how it works.</p>

<p>In a nutshell, the process of tool / function calling involves multiple steps, as illustrated below: 
<br />
<br /></p>

<p><strong>Figure 1.</strong> Flowchart of a basic LLM-powered application with function calling, illustrating the process from user input to final response generation.
<img src="/assets/img/function-calling.svg" alt="Figure 1" />
<br />
<br /></p>

<ol>
  <li>
    <p><strong>Tool / function calling:</strong>
The LLM receives a system message with core instructions (often including an assigned role), the user prompt, and a description of available functions with their parameters.
If the tool call is made in the middle of a conversation, the LLM also receives as input the conversation history of the session (i.e., since the initialization of the assistant or since the last reset). Otherwise, LLMs are stateless (i.e., they do not have persistent memory of previous interactions).
When using Scholaris, the conversation history is automatically stored in an attribute of the <code class="language-plaintext highlighter-rouge">Assistant</code> class, called <code class="language-plaintext highlighter-rouge">conversation_history</code>. 
Be sure to check out the <a href="https://nicomarr.github.io/scholaris/#conversation-history">documentation on how to access and use the conversation history</a>.
Importantly, the LLM does not “see” the source code for execution. Instead, it receives a description of the purpose and usage of a code element, like a Python function, usually provided as text formatted using JSON. 
The content of this JSON-formatted string is equivalent to the “docstring”. 
It is the programmer’s responsibility to ensure proper functionality. Based on the user prompt, the LLM returns the most suitable function name and its parameters for the Python interpreter to execute.</p>
  </li>
  <li>
    <p><strong>Execution of the selected tool / function:</strong>
The Python code for the selected function or tool (and any nested functions) is executed, using optional or required parameters provided by the LLM.
These functions may be designed to retrieve data from external databases, extract information from local files, or perform tasks like listing the contents of a specific directory. 
Scholaris includes several <a href="https://nicomarr.github.io/scholaris/#tools">built-in tools</a> that can be called by the LLM, such as:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">get_file_names</code></li>
      <li><code class="language-plaintext highlighter-rouge">extract_text_from_pdf</code></li>
      <li><code class="language-plaintext highlighter-rouge">get_titles_and_first_authors</code></li>
      <li><code class="language-plaintext highlighter-rouge">summarize_local_document</code></li>
      <li><code class="language-plaintext highlighter-rouge">describe_python_code</code></li>
      <li><code class="language-plaintext highlighter-rouge">id_converter_tool</code></li>
      <li><code class="language-plaintext highlighter-rouge">query_openalex_api</code></li>
      <li><code class="language-plaintext highlighter-rouge">query_semantic_scholar_api</code></li>
      <li><code class="language-plaintext highlighter-rouge">respond_to_generic_queries</code>
<br />
<br /></li>
    </ul>
  </li>
  <li>
    <p><strong>Response generation:</strong> 
Finally, the LLM is “called” again to generate a response based on the output of the executed function, the user prompt, and the conversation history as context (which also includes the system message).
A programmer can extend this step by implementing additional routines and logic, such as:</p>

    <ul>
      <li><strong>Self-reflection</strong> - a mechanism that can be implemented, allowing the LLM to evaluate the response it generated for accuracy and completeness, and repeat the response generation, if necessary.</li>
      <li><strong>Iterative loops</strong> - can be implemented, which call additional functions or prompt the user for more details, creating an iterative process by which the response is refined.</li>
      <li><strong>Multi-step problem solving</strong> - for complex queries. A workflow can be designed that breaks down tasks into multiple steps. 
 Different functions might be called in sequence or in parallel to gather all necessary information before a comprehensive response is formulated.</li>
      <li><strong>Integration of multiple function outputs</strong> - can be combined, allowing information from different sources to be synthesized to provide a more holistic answer.</li>
    </ul>
  </li>
</ol>

<p>These additional steps are not part of the core functionality of Scholaris and would need to be implemented by the user. 
<br />
<br /></p>

<h2 id="use-cases">Use cases</h2>

<p>Let’s explore a few practical use cases:</p>

<h3 id="1-extracting-information-and-summarizing-content-from-local-files">1. Extracting information and summarizing content from local files</h3>

<p>By default, the assistant has access to a single directory, called <code class="language-plaintext highlighter-rouge">data</code>. 
Within this directory, the assistant can list and read the following file formats and extensions: <code class="language-plaintext highlighter-rouge">.pdf</code>, <code class="language-plaintext highlighter-rouge">.txt</code>, <code class="language-plaintext highlighter-rouge">.md</code> or <code class="language-plaintext highlighter-rouge">.markdown</code>, <code class="language-plaintext highlighter-rouge">.csv</code>, and <code class="language-plaintext highlighter-rouge">.py</code>. 
If not already present, the <code class="language-plaintext highlighter-rouge">data</code> directory is created in the parent directory when the assistant is initialized.</p>

<p>Extracting information from local files is particularly useful for content that contains sensitive information (e.g., your local files might contain identifying information of study subjects) or for content that is outside your main area of expertise (e.g., a Python script for data analysis obtained from a colleague or collaborator, or a medical chart with diagnostic codes). Additionally, it is useful for documents that are very technical in nature or otherwise difficult to read.</p>

<p>Let’s use the <a href="https://github.com/nicomarr/scholaris/blob/main/scholaris/core.py">source code of Scholaris</a> as an example! To extract and summarize the content of the source code file, you must first copy it to your local <code class="language-plaintext highlighter-rouge">data</code> directory. 
You can do this using your file manager (e.g., Finder or Explorer), in the terminal, or by using the Python <code class="language-plaintext highlighter-rouge">shutil</code> module, like so:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">shutil</span>
<span class="n">shutil</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="s">"path/to/scholaris/core.py"</span><span class="p">,</span> <span class="s">"data/core.py"</span><span class="p">)</span> <span class="c1"># Make sure to replace "path/to/scholaris/core.py" with the actual path to the file.
</span></code></pre></div></div>
<p>Now you can ask the assistant to summarize the content of the file:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">response</span> <span class="o">=</span> <span class="n">assistant</span><span class="p">.</span><span class="n">chat</span><span class="p">(</span><span class="s">"Summarize the content of the file `core.py` in the local `data` directory."</span><span class="p">)</span>
</code></pre></div></div>
<p>You may also ask the assistant to list the contents of the <code class="language-plaintext highlighter-rouge">data</code> directory:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">response</span> <span class="o">=</span> <span class="n">assistant</span><span class="p">.</span><span class="n">chat</span><span class="p">(</span><span class="s">"List the contents of the local `data` directory."</span><span class="p">)</span>
</code></pre></div></div>
<p><br /></p>

<p>There are numerous ways to customize the assistant to suit your needs. 
In the next section, we will explore these possibilities in more detail. For now, let’s illustrate another use case.
<br />
<br /></p>

<h3 id="2-retrieving-citation-metrics-from-an-external-source-such-as-the-openalex-api">2. Retrieving citation metrics from an external source, such as the OpenAlex API</h3>

<p>The assistant can query the OpenAlex API to retrieve citation metrics for a given Digital Object Identifier (DOI). 
This is particularly helpful if you want to include citation metrics in your literature search or when you need to quickly assess the impact of a specific article. 
Here’s an example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">response</span> <span class="o">=</span> <span class="n">assistant</span><span class="p">.</span><span class="n">chat</span><span class="p">(</span><span class="s">"How often has the article with the DOI `10.1172/jci.insight.144499` been cited?"</span><span class="p">)</span>
</code></pre></div></div>
<p>This will prompt the assistant to query the OpenAlex API and return the citation metrics for the specified DOI.
<br />
<br /></p>

<h2 id="customizing-the-assistant-for-your-needs">Customizing the assistant for your needs</h2>

<div style="background-color: #e6ffe6; padding: 15px; border-radius: 15px; margin: 30px 0;">
	<strong>Tip:</strong>
	To customize the assistant, it is helpful to have a basic understanding of Python programming. 
	If you are new to Python, consider taking a beginner course, such as <a href="https://www.deeplearning.ai/short-courses/ai-python-for-beginners/">AI Python for Beginners</a>, 
	a free short course offered by <a href="https://www.deeplearning.ai/">DeepLearning.AI</a> (approximate time to complete: 4-5 hours).
	Basic knowledge of how to define a function in Python is all you need to customize the assistant with new tools.
</div>

<p>Scholaris is designed to be highly customizable, allowing you to extend its capabilities to suit your specific research needs. 
The core tools or functions are passed to the assistant like “building blocks” during initialization. 
Therefore, there is no need to modify the source code and Assistant class in order to expand the tools.
There are several ways to customize the assistant:</p>

<h3 id="1-limiting-or-replacing-the-core-functions">1. Limiting or replacing the core functions</h3>
<p>If you want to change the core functions, you can do so by passing the desired core functions as an argument (in the form of dictionaries) to the <code class="language-plaintext highlighter-rouge">Assistant</code> class when it is initialized. 
For example, to limit the assistant’s ability to respond to generic questions and access external data from the OpenAlex and Semantic Scholar APIs, you would initialize the assistant as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">assistant</span> <span class="o">=</span> <span class="n">Assistant</span><span class="p">(</span><span class="n">tools</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"query_openalex_api"</span><span class="p">:</span> <span class="n">query_openalex_api</span><span class="p">,</span>
    <span class="s">"query_semantic_scholar_api"</span><span class="p">:</span> <span class="n">query_semantic_scholar_api</span><span class="p">,</span>
    <span class="s">"respond_to_generic_queries"</span><span class="p">:</span> <span class="n">respond_to_generic_queries</span><span class="p">,</span>
	<span class="s">"describe_tools"</span><span class="p">:</span> <span class="n">describe_tools</span>
    <span class="p">})</span>
</code></pre></div></div>
<p>When the assistant is initialized in this way, it will no longer be able to access information from the local <code class="language-plaintext highlighter-rouge">data</code> directory or extract information from local files, even though the <code class="language-plaintext highlighter-rouge">data</code> directory is still present.</p>

<p>Similarly, you can initialize the assistant to only be able to extract information from local files and summarize the content of local documents:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">assistant</span> <span class="o">=</span> <span class="n">Assistant</span><span class="p">(</span><span class="n">tools</span> <span class="o">=</span> <span class="p">{</span>
	<span class="s">"get_file_names"</span><span class="p">:</span> <span class="n">get_file_names</span><span class="p">,</span>
	<span class="s">"extract_text_from_pdf"</span><span class="p">:</span> <span class="n">extract_text_from_pdf</span><span class="p">,</span>
	<span class="s">"summarize_local_document"</span><span class="p">:</span> <span class="n">summarize_local_document</span><span class="p">,</span>
	<span class="s">"describe_python_code"</span><span class="p">:</span> <span class="n">describe_python_code</span><span class="p">,</span>
	<span class="s">"respond_to_generic_queries"</span><span class="p">:</span> <span class="n">respond_to_generic_queries</span><span class="p">,</span>
	<span class="s">"describe_tools"</span><span class="p">:</span> <span class="n">describe_tools</span>
	<span class="p">})</span>
</code></pre></div></div>
<p>When the assistant is initialized in this way, it will no longer be able to make API calls to external sources.</p>

<p>It is recommended to keep the <code class="language-plaintext highlighter-rouge">describe_tools</code> function and <code class="language-plaintext highlighter-rouge">respond_to_generic_queries</code> function in the core tools to maintain the assistant’s ability to describe its tools (including newly added tools) and respond to generic queries, respectively. 
The latter tool also represents a fallback mechanism in case the assistant is unable to identify the user’s intent or the user’s query is outside the scope of the core tools. 
When using <a href="https://pypi.org/project/scholaris/">Scholaris</a>, the research assistant is designed to use a tool to generate a final response to a user’s prompt. This is to ensure that the assistant is primarily providing information which is relevant for health and life sciences. 
Otherwise it will abort the conversation. Be sure to check out the tools section in the <a href="https://nicomarr.github.io/scholaris/#tools">documentation</a> for more details (see callout box: <strong>What happens if the assistant is initialized without any tools?</strong>) 
<br />
<br /></p>

<h3 id="2-adding-new-tools">2. Adding new tools</h3>
<p>You can also add new tools to the assistant to extend its capabilities. In this case, the core tools will be appended, not replaced. This is achieved simply by defining a new function in Python. Be sure to use <a href="https://docs.python.org/3/library/typing.html#module-typing">type hints</a>, <a href="https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html">Google-style docstrings</a>, and the <code class="language-plaintext highlighter-rouge">@json_schema_decorator</code> function of the <a href="https://nicomarr.github.io/scholaris/core.html#json_schema_decorator">Scholaris Python package</a> to automatically generate the schema for your new function.
Your new function can then be passed to the <code class="language-plaintext highlighter-rouge">Assistant</code> class during initialization, like so:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">assistant</span> <span class="o">=</span> <span class="n">Assistant</span><span class="p">(</span><span class="n">add_tools</span> <span class="o">=</span> <span class="p">{</span><span class="s">"your_new_function"</span><span class="p">:</span> <span class="n">your_new_function</span><span class="p">})</span>
</code></pre></div></div>
<p>More details can be found in the <a href="https://nicomarr.github.io/scholaris/#defining-new-tools">Developer Guide</a> section of the documentation.</p>

<p>Let’s revisit a few key points and ideas to consider when using the <a href="https://pypi.org/project/scholaris/">Scholaris</a> package and customizing the assistant:</p>

<ul>
  <li>
    <p>Many other libraries and Software Development Kits (SDKs) require you to write JSON schemas of the functions or tools to be called by the LLM. 
This is simplified in Scholaris by using the <code class="language-plaintext highlighter-rouge">@json_schema_decorator</code> and Google-syle docstrings, which are easy to write and read.</p>
  </li>
  <li>
    <p><a href="https://pypi.org/project/scholaris/">Scholaris</a> is designed so that functions or tools are passed as building blocks during initialization of the assistant. 
Therefore, there is no need to modify the source code of the <code class="language-plaintext highlighter-rouge">Assistant</code> class in order to expand its capabilities, 
unless you want to implement more complex logic and agentic workflows, including multi-step reasoning and/or loops (more on this below).</p>
  </li>
  <li>
    <p><a href="https://pypi.org/project/scholaris/">Scholaris</a> is developed to serve as a framework for building LLM-powered research assistants in health and life sciences rather than a robust production-ready tool. 
Therefore, you may also modify the existing tools and functions or add similar tools to extend the assistant’s capabilities. 
Consider using the assistance of an LLM to modify the provided core functions to suit your specific research needs. To do so, you may use larger cloud-hosted LLMs to aid you in this process, although for simple modifications, smaller (local) models may suffice.
For example, you can modify the <code class="language-plaintext highlighter-rouge">summarize_local_document</code> function to extract specific information from a document that is relevant to your research by modifying the prompt used inside the function.</p>
  </li>
  <li>
    <p>Always use LLMs responsibly and be aware of their limitations. Use additional models, such as <a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/">Llama Guard</a>, in production environments to ensure that the assistant does not generate harmful or inappropriate content.
<br />
<br /></p>
  </li>
</ul>

<h3 id="3-implementing-more-complex-logic-and-agentic-workflows-for-advanced-users">3. Implementing more complex logic and agentic workflows (for advanced users)</h3>

<p>If you want to implement more complex logic and agentic workflows, such as multi-step reasoning, iterative loops, or self-reflection, you will need to modify the <a href="https://github.com/nicomarr/scholaris/blob/main/scholaris/core.py">source code</a> of the <code class="language-plaintext highlighter-rouge">Assistant</code> class. 
The Scholaris package has been written using a ‘literate’ programming style and <a href="https://nbdev.fast.ai/">nbdev</a>, which means that the source code is written in a way that is easy to read and understand. 
This makes it easier for you to modify the source code to suit your specific needs. Be sure to also view the Jupyter notebook with the ‘literate’ source code and additional tests <a href="https://github.com/nicomarr/scholaris/blob/main/nbs/01_core.ipynb">here</a>.
<br />
<br /></p>

<h2 id="wrapping-up">Wrapping up</h2>

<p>In this tutorial, you learned how to set up a LLM-powered research assistant for health and life sciences. Be sure to check out the <a href="https://nicomarr.github.io/scholaris/">documentation pages</a> for more details on how to use the <a href="https://pypi.org/project/scholaris/">Scholaris Python package</a>.
Consider it as an application to help you accelerate your research aimed at creating a positive impact, and keep the limitations of LLMs in mind:</p>
<blockquote>
  <p>Current AI systems lack several essential characteristics of human-level intelligence, including the ability to learn, navigate, and understand the physical world, persistent memory, the ability to plan complex action sequences, and the ability to be controllable and safe by design (not by fine-tuning).
<a href="https://www.youtube.com/watch?v=4DsCtgtQlZU">cf. Yann LeCun - Keynote at the Hudson Forum</a></p>
</blockquote>

<p><br />
<br />
<br />
If you spotted any errors or inconsistencies in this tutorial, please feel free to open an issue on the <a href="https://github.com/nicomarr/scholaris/issues">GitHub repository’s issue page</a>.</p>]]></content><author><name></name></author><category term="tutorials" /><summary type="html"><![CDATA[Welcome to the second in a series of short tutorials aimed at making Large Language Model (LLM)-powered applications more accessible for health and life sciences researchers. This tutorial introduces Scholaris, a Python package that allows anyone to set up a research assistant on a local computer and leverage function calling capabilities “out of the box”. Scholaris is designed specifically for use in health and life sciences to help gain insights from scholarly articles and interact with academic databases.]]></summary></entry><entry><title type="html">Streamlining Full-Text Article Retrieval for Research</title><link href="https://nicomarr.com/tutorials/2024/08/14/streamlining-full-text-article-retrieval-for-research.html" rel="alternate" type="text/html" title="Streamlining Full-Text Article Retrieval for Research" /><published>2024-08-14T00:00:00+00:00</published><updated>2024-08-14T00:00:00+00:00</updated><id>https://nicomarr.com/tutorials/2024/08/14/streamlining-full-text-article-retrieval-for-research</id><content type="html" xml:base="https://nicomarr.com/tutorials/2024/08/14/streamlining-full-text-article-retrieval-for-research.html"><![CDATA[<p>Welcome to the first in a series of short tutorials aimed at making LLM-powered applications more accessible for health and life sciences researchers. This tutorial introduces Python utility functions for interacting with the <a href="https://docs.openalex.org/how-to-use-the-api/api-overview">OpenAlex API</a>, a comprehensive, open-access catalog of global research named after the ancient Library of Alexandria and made by the nonprofit <a href="https://ourresearch.org/">OurResearch</a>.</p>

<p>While other excellent <a href="https://trangdata.github.io/openalexR-webinar/">community libraries</a> exist for querying the OpenAlex API, my focus here is on functions tailored for streamlining the retrieval of full-text articles indexed in PubMed and leveraging OpenAlex’s extensive citation data. These features are particularly valuable for Retrieval-Augmented Generation (RAG) applications, which can enhance language model performance and improve response quality.</p>

<p>Check out the accompanying <a href="https://github.com/nicomarr/public-tutorials">Jupyter notebook</a> to run all examples described below. Note that we use <a href="https://ipython.readthedocs.io/en/stable/interactive/magics.html">magic commands</a> (%) or (!) to run shell commands within Jupyter notebooks.
<br />
<br />
<br /></p>

<h3 id="installation-and-setup">Installation and setup</h3>

<p>(1) Install the required third-party libraries:</p>
<ul>
  <li>requests</li>
  <li>tqdm</li>
  <li>selenium</li>
  <li>webdriver-manager</li>
  <li>nbformat</li>
  <li>plotly</li>
</ul>

<p>In a Jupyter notebook environment, simply install these libraries using line magics and pip, like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%pip install -qU requests
%pip install -qU tqdm
%pip install -qU selenium
%pip install -qU webdriver-manager
%pip install -qU nbformat
%pip install -qU plotly
</code></pre></div></div>
<p><br />
<br />
(2) Download the file named <code class="language-plaintext highlighter-rouge">openalex_api_utils.py</code> from the following <a href="https://github.com/nicomarr/public-tutorials">GitHub repo</a>, and save it to your working directory (e.g., the same directory from which you run the accompanying notebook). The <code class="language-plaintext highlighter-rouge">openalex_api_utils.py</code> file contains all utility functions described below and in the accompanying notebook.</p>

<p>In Colab or any other Jupyter notebook environment:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!wget -q https://raw.githubusercontent.com/nicomarr/public-tutorials/main/openalex_api_utils.py
</code></pre></div></div>

<p>Or in a terminal emulator:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://raw.githubusercontent.com/nicomarr/public-tutorials/main/openalex_api_utils.py
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">wget</code> is not installed, you may also use curl:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -O https://raw.githubusercontent.com/nicomarr/public-tutorials/main/openalex_api_utils.py
</code></pre></div></div>
<p><br />
<br />
(3) Import the utility functions:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">openalex_api_utils</span> <span class="kn">import</span> <span class="o">*</span></code></pre></figure>

<p><br />
<br />
(4) Add your email address to the environment variables. In Google Colab, open the side panel, click on the ‘key’ icon and add a key-value pair with the key ‘EMAIL’ (all UPPERCASE, no dash) and your email address as the value, then enable notebook access. See the section below and <a href="https://x.com/GoogleColab/status/1719798406195867814">this link</a> for more details. Your email address is sent as part of the request to the OpenAlex API. This is a common and polite practice that helps speed up response times when making many API calls. It also helps developers contact you if there are any issues. For more details, follow <a href="https://docs.openalex.org/how-to-use-the-api/api-overview">this link</a>.
<br />
<br />
(5) Load your email address from the environment variables. In Google Colab, you can do that by running the following commands after you have added your email address in the Secrets tab, as described above.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">google.colab</span> <span class="kn">import</span> <span class="n">userdata</span>
<span class="n">EMAIL</span> <span class="o">=</span> <span class="n">userdata</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"EMAIL"</span><span class="p">)</span></code></pre></figure>

<p>If you work with in a Jypyter notebook environment on your local computer, you can import environment variables using the os module:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>
<span class="n">EMAIL</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">"EMAIL"</span><span class="p">]</span></code></pre></figure>

<p>Alternatively, you may also just define it directly, like so:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">EMAIL</span> <span class="o">=</span> <span class="s">"REPLACE_WITH_YOUR_EMAIL@example.com"</span></code></pre></figure>

<p>However, it is best practice to keep sensitive information like email addresses out of the code.
<br />
<br />
<br /></p>

<h3 id="basic-usage">Basic usage</h3>

<p>First, create a list containing unique identifyers of the works to retrieve information about. In OpenAlex, works can be PubMed articles, books, datasets, and theses. In this tutorial, we first get information about 3 PubMed articles using a unique identifyer for each article. Unique identifyers can be an OpenAlex ID, a PubMed ID (PMID), or a Digital Object Identifier (DOI).</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">uids</span> <span class="o">=</span> <span class="p">[</span><span class="s">'https://openalex.org/W4387665659'</span><span class="p">,</span> <span class="s">'33497357'</span><span class="p">,</span> <span class="s">'10.1126/sciimmunol.aau8714'</span><span class="p">]</span></code></pre></figure>

<p><br />
<br />
The first utility function we will be using is <code class="language-plaintext highlighter-rouge">get_works()</code>. Here, we pass in as argument the list with the unique identifiers, and the email address we have imported from the environment variables. Note that DOIs are accepted with or without a <code class="language-plaintext highlighter-rouge">https://doi.org/</code> prefix, and OpenAlex IDs are accepted with or without a <code class="language-plaintext highlighter-rouge">https://openalex.org/</code> prefix. We can also set a third (optional) argument, <code class="language-plaintext highlighter-rouge">show_progress=True</code>, to show a progress bar. The function returns two list objects. The first list object (which we will name ‘works’) contains all the information retrieved from the OpenAlex API. The second list object contains only messages in case anything goes wrong, for example, if one or more IDs provided do not exist in the database. If everything is fine, this list will be empty.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">works</span><span class="p">,</span> <span class="n">failed_calls</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">ids</span><span class="o">=</span><span class="n">uids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"All works were successfully retrieved."</span><span class="p">)</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">failed_calls</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="k">print</span><span class="p">(</span><span class="s">"Some of the works could not be retrieved."</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving works: 100%|██████████| 3/3 [00:01&lt;00:00,  1.92it/s]
All works were successfully retrieved.
</code></pre></div></div>
<p><br />
<br />
We can access the retrieved metadata simply by indexing into the works object, which is a list of dictionaries. The data obtained from the OpenAlex API is stored under the key ‘metadata’.
The next line prints the title of the first work in the list (for those new to Python, indexing starts at 0).</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span><span class="p">(</span><span class="n">works</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'metadata'</span><span class="p">][</span><span class="s">'title'</span><span class="p">])</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Harnessing large language models (LLMs) for candidate gene prioritization and selection
</code></pre></div></div>
<p><br />
<br />
To list all the works succesfully retrieved, we can use the <code class="language-plaintext highlighter-rouge">list_works()</code>function. This will display selected metadata of the retrieved articles in html format, including first author, title, journal, publication year, how many times it has been cited, the number of references, and 10 related works (which can also be retrieved from the OpenAlex API). Note that the symbols in the output indicate whether the article is open access or not, and whether the full text is available or not.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">list_works</span><span class="p">(</span><span class="n">works</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output (html):</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Toufiq et al. Harnessing large language models (LLMs) for candidate gene prioritization and selection. Journal of Translational Medicine 2023
Cited by: 10 | References: 64 | Related works: 10
Download PDF   Read Full Text   🔓   📖
 
Khan et al. Distinct antibody repertoires against endemic human coronaviruses in children and adults. JCI Insight 2021
Cited by: 53 | References: 70 | Related works: 10
Download PDF   Read Full Text   🔓   📖
 
Boisson‐Dupuis et al. Tuberculosis and impaired IL-23–dependent IFN-γ immunity in humans homozygous for a common TYK2 missense variant. Science Immunology 2018
Cited by: 152 | References: 99 | Related works: 10
Download PDF   Read Full Text   🔓   📖
</code></pre></div></div>
<p><br />
<br /></p>

<h3 id="download-pdf-files">Download PDF files</h3>

<p>Before we proceed with downloading PDF files, please read the following copyright notice:</p>

<div style="background-color: #f0f0f0; border: 1px solid #d0d0d0; padding: 10px; margin: 10px 0;">
<strong>Copyright Notice:</strong> Downloading PDFs may be subject to copyright restrictions. Users are responsible for ensuring they have the right to access and download the content. Always respect the terms of use of the content providers and adhere to applicable copyright laws. See the following <a href="https://github.com/nicomarr/public-tutorials/blob/main/README.md">README.md</a> file for further details.
</div>

<p><br />
We can pass an additional argument to the <code class="language-plaintext highlighter-rouge">get_works()</code> function to save the PDF files a specified directory, like so:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">works</span><span class="p">,</span> <span class="n">failed_calls</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">ids</span><span class="o">=</span><span class="n">uids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Requests: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">uids</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">Retrieved: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">works</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">PDF files downloaded: </span><span class="si">{</span><span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">works</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">])</span><span class="si">}</span><span class="se">\n</span><span class="s">Failed calls: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">failed_calls</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving works: 100%|██████████| 3/3 [00:09&lt;00:00,  3.25s/it]
Requests: 3
Retrieved: 3
PDF files downloaded: 2
Failed calls: 0
</code></pre></div></div>
<p><br />
<br />
The PDF files can then be used for parsing the full text, tables, and figures of the articles for retrieval augmented generation. All this will be explained in upcoming tutorials. For now, let’s pay attention to the PDFs and notice that only two PDF files were successfully downloaded, even though all three articles are open access. This is because some publishers have put requirements in place that force us to use a web browser to download the PDFs. We will automate this in a later step. Let’s first just inspect the output. Each element in the returned list object has the follwoing dictionary keys:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span><span class="p">(</span><span class="n">works</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">keys</span><span class="p">())</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dict_keys(['uid', 'entry_types', 'metadata', 'pdf_path', 'status_messages', 'persist_datetime'])
</code></pre></div></div>
<p><br />
<br />
We can get status messages for each work using the ‘status_messages’ key.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">works</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Title: </span><span class="si">{</span><span class="n">work</span><span class="p">[</span><span class="s">'metadata'</span><span class="p">][</span><span class="s">'title'</span><span class="p">][</span><span class="si">:</span><span class="mi">80</span><span class="p">]</span><span class="si">}</span><span class="s">...</span><span class="se">\n</span><span class="s">Status messages: </span><span class="si">{</span><span class="n">work</span><span class="p">[</span><span class="s">'status_messages'</span><span class="p">]</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Title: Harnessing large language models (LLMs) for candidate gene prioritization and se...
Status messages: 2024-08-14: Successfully retrieved metadata with UID W4387665659. 2024-08-14: PDF saved to ./pdfs/37845713_10.1186#s12967-023-04576-8_W4387665659.pdf. 

Title: Distinct antibody repertoires against endemic human coronaviruses in children an...
Status messages: 2024-08-14: Successfully retrieved metadata with UID 33497357. 2024-08-14: PDF saved to ./pdfs/33497357_10.1172#jci.insight.144499_W3125794218.pdf. 

Title: Tuberculosis and impaired IL-23–dependent IFN-γ immunity in humans homozygous fo...
Status messages: 2024-08-14: Successfully retrieved metadata with UID 10.1126/sciimmunol.aau8714. 2024-08-14: Failed to download PDF from https://immunology.sciencemag.org/content/immunology/3/30/eaau8714.full.pdf. Status code: 403. Selenium disabled. 
</code></pre></div></div>
<p><br />
<br />
We can also get the paths to the PDFs that were downloaded by using the <code class="language-plaintext highlighter-rouge">pdf_path</code> key. Note that each PDF file is saved using the following naming convention:</p>

<p><code class="language-plaintext highlighter-rouge">{PMID}_{DOI}_{OpenAlex ID}.pdf</code></p>

<p>with <code class="language-plaintext highlighter-rouge">/</code> replaced by <code class="language-plaintext highlighter-rouge">#</code>. The value is <code class="language-plaintext highlighter-rouge">None</code> if the PDF file was not downloaded.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">works</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"File path: </span><span class="si">{</span><span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>File path: ./pdfs/37845713_10.1186#s12967-023-04576-8_W4387665659.pdf

File path: ./pdfs/33497357_10.1172#jci.insight.144499_W3125794218.pdf

File path: None
</code></pre></div></div>
<p><br />
<br />
Thanks to the <a href="https://www.selenium.dev/">Selenium Browser Automation Project</a>, we can automate web browsers. This additional functionality requires the function to be run in a environment with the <a href="https://www.google.com/chrome/">Google Chrome Browser</a> installed (e.g, in a virtual machine or on your local computer). Therefore, it will not work in the Google Colab environment.</p>

<p>First, let’s remove any downloaded files from the previous run to give us a clean slate.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="err">!</span><span class="n">rm</span> <span class="o">-</span><span class="n">rf</span> <span class="p">.</span><span class="o">/</span><span class="n">pdfs</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Removed pdfs directory and all its contents."</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Removed pdfs directory and all its contents.
</code></pre></div></div>
<p><br />
<br />
Now, let’s rerun the <code class="language-plaintext highlighter-rouge">get_works()</code> function with an additional (optional) argument, namely <code class="language-plaintext highlighter-rouge">enable_selenium</code> set to <code class="language-plaintext highlighter-rouge">True</code>. This will enable the Selenium browser automation tool to be used in the background to retrieve the full text PDFs of the works that cannot be retrieved using the <code class="language-plaintext highlighter-rouge">requests</code> library.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">works</span><span class="p">,</span> <span class="n">failed_calls</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">uids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">enable_selenium</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Requested: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">uids</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">Retrieved: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">works</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">With PDFs: </span><span class="si">{</span><span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">works</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Failed calls: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">failed_calls</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving works: 100%|██████████| 3/3 [00:12&lt;00:00,  4.23s/it]
Requests: 3
Retrieved: 3
PDF files downloaded: 3
Failed calls: 0
</code></pre></div></div>
<p><br />
<br />
Some publishers require PDFs to be accessed via a browser with a visible user interface. When <code class="language-plaintext highlighter-rouge">enable_selenium</code> set to <code class="language-plaintext highlighter-rouge">True</code>, the default option is to invoke the browser to run as a background process (i.e., in headless mode). By passing in an additional (optional) argument, <code class="language-plaintext highlighter-rouge">is_headless=False</code>, we can fully automate a web browser. This will cause a web browser window to automatically open and close for each article that cannot be downloaded using the <code class="language-plaintext highlighter-rouge">requests</code> library.
<br />
<br /> 
<br /></p>

<h3 id="persist--load-metadata">Persist &amp; load metadata</h3>

<p>In addition to downloading PDF files, we can pass in an optional argument to the <code class="language-plaintext highlighter-rouge">get_works()</code>function to save the metadata to a specified directory. In doing so, the metadata for each article will be saved as a separate JSON file, using a similar naming convention as for the PDF files. The metadata can then be used later then querying an index during retrieval augmentated [text] generation. This will be the focus of a future tutorial. For now, let’s run the following code to demonstrate this additional functionality:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">works</span><span class="p">,</span> <span class="n">failed_calls</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">uids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">persist_dir</span><span class="o">=</span><span class="s">"./cache"</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="o">%</span><span class="n">ls</span> <span class="p">.</span><span class="o">/</span><span class="n">cache</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving works: 100%|██████████| 3/3 [00:09&lt;00:00,  3.10s/it]
30578352_10.1126#sciimmunol.aau8714_W2906653622.json
33497357_10.1172#jci.insight.144499_W3125794218.json
37845713_10.1186#s12967-023-04576-8_W4387665659.json
</code></pre></div></div>
<p><br />
<br />
Works can be loaded from storage using the <code class="language-plaintext highlighter-rouge">load_works_from_storage()</code> function, simply by providing the path to the directory where the JSON files of the works are stored. This function returns a list of works, similar to the <code class="language-plaintext highlighter-rouge">get_works()</code> function. When sorted by the <code class="language-plaintext highlighter-rouge">uid</code> (or alternatively, by using <code class="language-plaintext highlighter-rouge">persist_datetime</code> as key), we can assert that both list objects are equal.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">works_from_storage</span> <span class="o">=</span> <span class="n">load_works_from_storage</span><span class="p">(</span><span class="n">persist_dir</span><span class="o">=</span><span class="s">"./cache"</span><span class="p">)</span>
<span class="n">works</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'uid'</span><span class="p">])</span>
<span class="n">works_from_storage</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'uid'</span><span class="p">])</span>
<span class="k">assert</span> <span class="n">works</span> <span class="o">==</span> <span class="n">works_from_storage</span></code></pre></figure>

<p><br />
<br />
Note that the <code class="language-plaintext highlighter-rouge">get_works()</code> function also uses the <code class="language-plaintext highlighter-rouge">load_works_from_storage()</code> function to check the cache first before making a request to the API; that is, if the storage location is specified using the <code class="language-plaintext highlighter-rouge">persist_dir</code> argument. If a work is found in the cache, it is returned directly. This speeds up the process and reduces the number of API calls made. We can illustrate this by running the get_works function again with the same uids. Before the first call, we will clear the cache directory to ensure that the works are retrieved from the API. Note the ~200x speedup when executing the function a second time.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">%</span><span class="n">rm</span> <span class="o">-</span><span class="n">rf</span> <span class="p">.</span><span class="o">/</span><span class="n">cache</span>
<span class="n">_works</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">uids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">persist_dir</span><span class="o">=</span><span class="s">"./cache"</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">_works</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">uids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">persist_dir</span><span class="o">=</span><span class="s">"./cache"</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="o">%</span><span class="n">ls</span> <span class="p">.</span><span class="o">/</span><span class="n">cache</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving works: 100%|██████████| 3/3 [00:01&lt;00:00,  2.41it/s]
Retrieving works: 100%|██████████| 3/3 [00:00&lt;00:00, 536.10it/s]
</code></pre></div></div>
<p><br />
<br />
To get further help on the <code class="language-plaintext highlighter-rouge">get_works()</code> function and to see all arguments available, execute <code class="language-plaintext highlighter-rouge">help(get_works)</code>.
<br />
<br />
<br /></p>

<h3 id="get-citations">Get citations</h3>

<p>Next, we want to get all articles that have cited any of the 3 articles for which we obtained the PDFs and metadata in the first place. We can do so by using the <code class="language-plaintext highlighter-rouge">get_citations()</code> function, which accepts largely the same arguments as the <code class="language-plaintext highlighter-rouge">get_works()</code> function, with two key differences:</p>

<ol>
  <li>We pass in the <code class="language-plaintext highlighter-rouge">works</code> object (output of the <code class="language-plaintext highlighter-rouge">get_works()</code> function) directly.</li>
  <li>The process for the API call is slightly different (hence we use a separate function). This this is not important here.</li>
</ol>

<p>The output is largely the same as for the <code class="language-plaintext highlighter-rouge">get_works()</code> function, with the difference that the value for <code class="language-plaintext highlighter-rouge">entry_types</code> is automatically set to “citing primary entry”. This will allow us to differentiate between the primary articles and the articles that cite them. Moreover, the function returns single list object, not a tuple. The basic usage is as follows:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">citations</span> <span class="o">=</span> <span class="n">get_citations</span><span class="p">(</span><span class="n">works</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving citations: 100%|██████████| 3/3 [00:03&lt;00:00, 1.32s/it]
Processing citations: 100%|██████████| 216/216 [00:00&lt;00:00, 1332308.33it/s]
</code></pre></div></div>
<p><br />
<br /></p>
<div style="background-color: #f0f0f0; border: 1px solid #d0d0d0; padding: 10px; margin: 10px 0;">
<strong>Reminder</strong>: When using the <strong>get_citations()</strong> function to download PDFs, please be aware of potential copyright restrictions. Ensure you have the right to access and download the content, and always respect the terms of use of the content providers. Refer to the Copyright Notice in the following <a href="https://github.com/nicomarr/public-tutorials/blob/main/README.md">README.md</a> file for more details.
</div>

<p><br />
To download PDFs and store the metadata in a cache directory, we can pass in the <code class="language-plaintext highlighter-rouge">pdf_output_dir</code> and <code class="language-plaintext highlighter-rouge">persist_dir</code> arguments, like so:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">citations</span> <span class="o">=</span> <span class="n">get_citations</span><span class="p">(</span><span class="n">works</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">persist_dir</span><span class="o">=</span><span class="s">"./cache"</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Citations retrieved: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">citations</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">PDF files downloaded: </span><span class="si">{</span><span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">citations</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving citations: 100%|██████████| 3/3 [00:02&lt;00:00,  1.19it/s]
Processing citations: 100%|██████████| 222/222 [00:00&lt;00:00, 152570.13it/s]
Retrieving PDFs: 100%|██████████| 222/222 [06:00&lt;00:00,  1.62s/it]
Persisting data: 100%|██████████| 222/222 [00:00&lt;00:00, 624.10it/s]
Citations retrieved: 222
PDF files downloaded: 102
</code></pre></div></div>
<p><br />
<br />
We can also enable the Selenium WebDriver and automate Chrome in headless or standard mode. This is done the same way as for the <code class="language-plaintext highlighter-rouge">get_works()</code> function.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">citations</span> <span class="o">=</span> <span class="n">get_citations</span><span class="p">(</span><span class="n">works</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">persist_dir</span><span class="o">=</span><span class="s">"./cache"</span><span class="p">,</span> <span class="n">enable_selenium</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">is_headless</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Citations retrieved: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">citations</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">PDF files downloaded: </span><span class="si">{</span><span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">citations</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving citations: 100%|██████████| 3/3 [00:02&lt;00:00,  1.15it/s]
Processing citations: 100%|██████████| 222/222 [00:00&lt;00:00, 545800.40it/s]
Retrieving PDFs: 100%|██████████| 222/222 [09:37&lt;00:00,  2.60s/it]
Persisting data: 100%|██████████| 222/222 [00:00&lt;00:00, 616.35it/s]
Citations retrieved: 222
PDF files downloaded: 151
</code></pre></div></div>
<p><br />
<br />
The remaining citations for which the PDFs could not be downloaded have to be retrieved manually. Most of them are not open access. We will get back to this later; for now, let’s get the references and related works as a next step.
<br />
<br />
<br /></p>

<h3 id="get-references-and-related-works">Get references and related works</h3>
<p>Now, let’s retrieve all references and related works for the three articles we obtained earlier. We’ll use list comprehensions to gather this information efficiently. First, we’ll collect the references for each article. References are the works cited by our original articles. We’ll then flatten the resulting list of lists into a single list of reference IDs. Next, we’ll gather the related works. <a href="https://docs.openalex.org/api-entities/works/work-object#related_works">Related works</a> are identified through an algorithmic process that selects recent papers sharing the most conceptual similarities with a given paper. This selection may include preprints from bioRxiv, which might not yet be indexed in PubMed.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">references_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">work</span><span class="p">[</span><span class="s">'metadata'</span><span class="p">][</span><span class="s">'referenced_works'</span><span class="p">]</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">works</span><span class="p">]</span> <span class="c1"># List comprehension
</span><span class="n">references_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">item</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">references_ids</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">sublist</span><span class="p">]</span> <span class="c1"># Flatten the lists
</span><span class="n">related_works_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">work</span><span class="p">[</span><span class="s">'metadata'</span><span class="p">][</span><span class="s">'related_works'</span><span class="p">]</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">works</span><span class="p">]</span> <span class="c1"># List comprehension
</span><span class="n">related_works_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">item</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">related_works_ids</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">sublist</span><span class="p">]</span> <span class="c1"># Flatten the lists</span></code></pre></figure>

<p><br />
<br />
Now we can use the <code class="language-plaintext highlighter-rouge">get_works()</code> function in the way that allowed us to retrieve the metadata and PDF files of the 3 articles in the first place. Note the additional (optional) arguments that we pass to the <code class="language-plaintext highlighter-rouge">get_works()</code> function, as before. Specifically, we pass values for the <code class="language-plaintext highlighter-rouge">persist_dir</code> and <code class="language-plaintext highlighter-rouge">pdf_output_dir</code> arguments, which will determine if and where we save the metadata for each article and PDF files to disk. This will save us time in the future if we want to access the metadata for the works again.</p>

<p>We also specify a field called <code class="language-plaintext highlighter-rouge">entry_type</code>, which indicates the type of entry we are retrieving. This field will be usefull later when we want to get information about how we retrieved the metadata for each work in the first place. This time, it is not necessary to store the output of the failed calls. Since we will pass in output from the <code class="language-plaintext highlighter-rouge">get_works()</code> function, all IDs used as input here must be valid IDs.</p>

<p>For now, we will retrieve the references and related works, and download PDFs with the Selenium WebDriver disabled. This can be done in Colab.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">references</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">references_ids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">entry_type</span><span class="o">=</span><span class="s">"reference of primary entry"</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">related_works</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">related_works_ids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">entry_type</span><span class="o">=</span><span class="s">"related to primary entry"</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"References retrieved: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">references</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">PDF files downloaded: </span><span class="si">{</span><span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">references</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Related works retrieved: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">related_works</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">PDF files downloaded: </span><span class="si">{</span><span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">related_works</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving works: 100%|██████████| 233/233 [05:31&lt;00:00,  1.42s/it]
Retrieving works: 100%|██████████| 30/30 [00:37&lt;00:00,  1.26s/it]
References retrieved: 233
PDF files downloaded: 74
Related works retrieved: 30
PDF files downloaded: 12
</code></pre></div></div>
<p><br />
<br />
As described above, we can save the metedata to disk. In addition, we can set <code class="language-plaintext highlighter-rouge">enable_selenium=True</code> and <code class="language-plaintext highlighter-rouge">is_headless=False</code> to enable the Selenium WebDriver with Chrome in standard mode, which will allow us to retrieve more PDF files. This additional functionality requires the function to be run in an environment with the <a href="https://www.google.com/chrome/">Google Chrome Browser</a> installed (e.g, in a virtual machine or on your local computer). Therefore, it will not work in the Colab environment. Also note that PDF files of articles which are not open access are not downloaded.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">references</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">references_ids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">persist_dir</span><span class="o">=</span><span class="s">"./cache"</span><span class="p">,</span> <span class="n">entry_type</span><span class="o">=</span><span class="s">"reference of primary entry"</span><span class="p">,</span> <span class="n">enable_selenium</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">is_headless</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">related_works</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">get_works</span><span class="p">(</span><span class="n">related_works_ids</span><span class="p">,</span> <span class="n">email</span><span class="o">=</span><span class="n">EMAIL</span><span class="p">,</span> <span class="n">pdf_output_dir</span><span class="o">=</span><span class="s">"./pdfs"</span><span class="p">,</span> <span class="n">persist_dir</span><span class="o">=</span><span class="s">"./cache"</span><span class="p">,</span> <span class="n">entry_type</span><span class="o">=</span><span class="s">"related to primary entry"</span><span class="p">,</span> <span class="n">enable_selenium</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">is_headless</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"References retrieved: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">references</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">PDF files downloaded: </span><span class="si">{</span><span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">references</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Related works retrieved: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">related_works</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">PDF files downloaded: </span><span class="si">{</span><span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">related_works</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Retrieving works: 100%|██████████| 233/233 [11:49&lt;00:00,  3.05s/it]
Retrieving works: 100%|██████████| 30/30 [00:29&lt;00:00,  1.03it/s]
References retrieved: 233
PDF files downloaded: 147
Related works retrieved: 30
PDF files downloaded: 12
</code></pre></div></div>
<p><br />
<br />
Finally, we can print the total number of works retrieved, which of them are open access, and the total number of PDF files downloaded.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">total_works</span> <span class="o">=</span> <span class="n">works</span> <span class="o">+</span> <span class="n">citations</span> <span class="o">+</span> <span class="n">references</span> <span class="o">+</span> <span class="n">related_works</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Total number of works retrieved:"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">total_works</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Total number of open access works:"</span><span class="p">,</span> <span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">total_works</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'metadata'</span><span class="p">][</span><span class="s">'open_access'</span><span class="p">][</span><span class="s">'is_oa'</span><span class="p">]]))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Total number of PDF files downloaded:"</span><span class="p">,</span> <span class="nb">len</span><span class="p">([</span><span class="n">work</span> <span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">total_works</span> <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">]))</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Total number of works retrieved: 488
Total number of open access works: 386
Total number of PDF files downloaded: 312
</code></pre></div></div>
<p><br />
<br />
To access the status messages of works where a PDF file could not be retrieved, we can use the following code snippet:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">work</span> <span class="ow">in</span> <span class="n">total_works</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">work</span><span class="p">[</span><span class="s">'pdf_path'</span><span class="p">]</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Title: </span><span class="si">{</span><span class="n">work</span><span class="p">[</span><span class="s">'metadata'</span><span class="p">][</span><span class="s">'title'</span><span class="p">][</span><span class="si">:</span><span class="mi">80</span><span class="p">]</span><span class="si">}</span><span class="s">...</span><span class="se">\n</span><span class="s">Status messages: </span><span class="si">{</span><span class="n">work</span><span class="p">[</span><span class="s">'status_messages'</span><span class="p">]</span><span class="si">}</span><span class="se">\n</span><span class="s">DOI: </span><span class="si">{</span><span class="n">work</span><span class="p">[</span><span class="s">'metadata'</span><span class="p">][</span><span class="s">'ids'</span><span class="p">].</span><span class="n">get</span><span class="p">(</span><span class="s">'doi'</span><span class="p">,</span> <span class="s">'None'</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span></code></pre></figure>

<p><strong><em>Output:</em></strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Title: Autoimmune pathways in mice and humans are blocked by pharmacological stabilizat...
Status messages: 2024-08-14: PDF download from https://stm.sciencemag.org/content/scitransmed/11/502/eaaw1736.full.pdf using Selenium with headless mode set to False failed. 
DOI: https://doi.org/10.1126/scitranslmed.aaw1736

Title: Mendelian susceptibility to mycobacterial disease: recent discoveries...
Status messages: 2024-08-14: PDF URL not found in API call response. Skipped PDF download. 
DOI: https://doi.org/10.1007/s00439-020-02120-y

...

Title: Workflow Analysis using Graph Kernels....
Status messages: 2024-08-14: Successfully retrieved metadata with UID W2182707996. 2024-08-14: Work with UID https://openalex.org/W2182707996 is not open access or 'best_oa_location' key not found. Skipped PDF download. 
DOI: None

Title: Automating Radiologist Workflow, Part 2: Hands-Free Navigation...
Status messages: 2024-08-14: Successfully retrieved metadata with UID W2029380707. 2024-08-14: Work with UID https://openalex.org/W2029380707 is not open access or 'best_oa_location' key not found. Skipped PDF download. 
DOI: https://doi.org/10.1016/j.jacr.2008.05.012
</code></pre></div></div>
<p><br />
<br />
Be sure to check out the accompanying <a href="https://github.com/nicomarr/public-tutorials">Jupyter notebook</a>, which also includes a bonus feature to visualize open access statistics for the retrieved works.
<br />
<br />
<br /></p>

<h3 id="wrapping-up">Wrapping up</h3>

<p>In this tutorial, we explored using Python utility functions to interact with the OpenAlex API for retrieving full-text articles and leveraging citation data. Key points include:</p>

<ol>
  <li>Retrieving metadata and downloading PDF files using OpenAlex IDs, PMIDs, or DOIs.</li>
  <li>Obtaining citations, references, and related works for articles.</li>
  <li>Persisting metadata and automating PDF downloads with Selenium WebDriver.</li>
</ol>

<p>We demonstrated the efficiency of this approach by automating the download of 312 PDF files out of 386 open access works, from a total of 488 works retrieved. Key takeaways:</p>

<ul>
  <li>Subscriptions are needed for non-open access content.</li>
  <li>Use <a href="https://unpaywall.org/products/extension">Unpaywall</a> for open access versions of articles not automatically downloaded.</li>
  <li>Check the status_messages field for information on unretreived full-text content.</li>
  <li>Google Colab users should download data before closing sessions.</li>
  <li>PDF files are named using the convention: {PMID}{DOI}{OpenAlex ID}.pdf.</li>
</ul>

<p>These utility functions provide a foundation for automating full-text article retrieval and metadata collection. Future tutorials will explore text analysis, information extraction, and integration with language models.</p>

<p><br />
<br />
If you encounter any bugs in the code, have suggestions for improvements, or would like to request new features, please submit an issue at <a href="https://github.com/nicomarr/public-tutorials">my GitHub repo</a>. Your feedback is valuable for improving these tools for the research community.</p>]]></content><author><name></name></author><category term="tutorials" /><summary type="html"><![CDATA[Welcome to the first in a series of short tutorials aimed at making LLM-powered applications more accessible for health and life sciences researchers. This tutorial introduces Python utility functions for interacting with the OpenAlex API, a comprehensive, open-access catalog of global research named after the ancient Library of Alexandria and made by the nonprofit OurResearch.]]></summary></entry></feed>