Data ingestion just got a lot easier with Unstructured and Astra DB.
That’s why Unstructured is such a powerful platform; it enables developers to convert any document, file type, or layout into LLM-ready data. Unstructured is a no-code cloud service that stands up GenAI data pipelines, from transformation and cleaning to generating embeddings for a vector database.
What is Unstructured with Astra DB?
Building Your Application with Unstructured and Astra DB
from unstructured.partition.pdf import partition_pdf
# Returns a List[Element] present in the pages of the parsed pdf document
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf")
# Applies the English and Swedish language pack for ocr. OCR is only applied
# if the text is not available in the PDF.
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", languages=["eng", "swe"])
Web URL Scraping
Unstructured with Astra DB can also facilitate a web-scraping framework, pulling data from HTML web pages, parsing them into structured text data, and generating the embeddings for storage into Astra DB. This provides a powerful way to perform a number of important tasks. For example, internal and external documentation pages can be parsed so that developers can build chatbots that enable users to query information from the documentation.
from unstructured.partition.html import partition_html
url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html"
elements = partition_html(url=url)
print("\n\n".join([str(el) for el in elements]))
Building an Email Database
Unstructured includes the ability to handle email messages and perform much of the same processing as shown above. While search is often an effective way to retrieve information, using an LLM to access information within emails can power a wide variety of use cases. For example, developers can ask the LLM to retrieve all emails that include receipts, or that discuss a particular topic.
from unstructured.partition.email import partition_email
elements = partition_email(filename="example-docs/fake-email.eml")
with open("example-docs/fake-email.eml", "r") as f:
elements = partition_email(file=f)
Unstructured with Astra Data Loader for PDFs
Astra DB users can simplify the processing of PDFs using the new Astra Data Loader. Supporting multiple files and large file sizes, users can now ingest PDFs directly through the Astra DB portal. The Data Loader handles everything else, leveraging Unstructured.io's capabilities to partition and chunk documents. If you’re also using Vectorize, embeddings are automatically generated with your preferred provider. No coding is necessary!
Unstructured with Langflow
Self-managed Langflow now offers flexible document ingestion with an Unstructured component. Upload a variety of file types, including PDFs, images, videos, Word documents, and PowerPoint presentations. The integration supports both the Unstructured serverless API and local Unstructured installations for simple document ingestion within your Langflow flows.
Leaders like Barracuda and Temporal manage Apache Cassandra® with Astra DB. You can too.
What is Unstructured?
What is Astra DB?
How does Unstructured work?
When is it best to use the Unstructured integration?
Is it free to use the Unstructured integration?
Do I need an Unstructured account to access this integration?
The open-source version of Unstructured can be installed with:
Leaders like Barracuda and Temporal manage Apache Cassandra® with Astra DB. You can too.
Integrate Unstructured with Langflow and Astra DB
Data ingestion just got a lot easier with Unstructured, Langflow, and Astra DB.