Challenge 3: Summarizing a large document using chaining

Introduction

The objective of this challenge is to try to get a summary of a complete paper. For the title it’s okay to just look at a part of the document, but generating a summary for the complete document requires an alternative approach, namely LLM chains.

Note
Although the expanding context windows of LLMs are gradually reducing the need for this technique, it remains relevant in specific use cases. In our case there are papers like this, with more than 10K pages and 10s of millions of characters, exceeding well beyond the context window of current models. Also keep in mind that in some cases chaining might still be more memory efficient (processing chunks individually instead of whole documents) and more flexible (by integrating data from diverse information sources & tools within a single workflow). So, the optimal approach depends on the specific requirements of the task and the available resources.

There’s roughly 3 different approaches we can take; Stuffing is the most basic approach where the full content (possibly from multiple documents) is provided as the context. However this only works with smaller documents due to the context length limits.

The Map-Reduce chain is an alternative approach that’s designed to handle large/multiple documents. In essence it makes multiple calls to an LLM for chunks of content (usually in parallel). It first applies an LLM to each document/chunk individually (the Map phase), then the results (outputs of the LLM) are combined and sent to an LLM again to get a single output (the Reduce phase). Typically different prompts are used for the Map and Reduce phases.

The Refine chain approach also makes multiple calls to an LLM, but it does that in an iterative fashion. It starts with the first document/chunk, passes its content and gets a response, and then gets to the second document/chunk passing that content plus the response from the previous call, iterating until the last document/chunk and then passing the last (rolling) response and getting a final answer.

Description

In order to get the summaries, we’ll implement the Refine approach for this challenge. Most of the code is already provided in the extract_summary_from_text method in Cloud Run Function. Similar to the previous challenge, you’re expected to design the prompt and provide the right values to the placeholders.

Success Criteria

For this paper we expect a summary containing the following points:

The author argues that the standard cosmological model (SMoC) is incorrect and that there is no dark matter.
The author provides several arguments for this, including:

* The observed properties of galaxies are consistent with them being self-regulated, largely isolated
  structures that sometimes interact.
* The observed uniformity of the galaxy population is evidence against the standard cosmological model.
* The large observed ratio of star-forming galaxies over elliptical galaxies is evidence against
  the standard cosmological model.

The author concludes that understanding galaxies purely as baryonic, self-gravitating systems becomes
simple and predictive. The author criticizes the cosmological community's negative reactions to evidence
against dark matter and to the Integrated Galactic IMF (IGIMF) theory.

Note
By their nature, LLM results can vary, this is something to expect so your exact text may not match the above, but the intent should be the same.

Learning Resources

Using Python str.format
Prompt Engineering

Previous Challenge Next Challenge