Files
nv-ingest/client/client_examples/examples/python_client_usage.ipynb

703 lines
22 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "056ce677",
"metadata": {},
"source": [
"# NV-Ingest: Python Client Quick Start Guide\n",
"\n",
"This notebook provides a quick start guide to using the NV-Ingest Python API to create a client that interacts with a running NV-Ingest cluster. It will walk through the following:\n",
"\n",
"- Instantiate a client object\n",
"- Define the task configuration for an NV-Ingest job\n",
"- Submit a job the the NV-Ingest cluster\n",
"- Retrieve completed results\n",
"- Investigate the multimodal extractions\n"
]
},
{
"cell_type": "markdown",
"id": "c14fe242",
"metadata": {},
"source": [
"Specify a few parameters to connect to our nv-ingest cluster and a notional document to guide the examples."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7727953",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"# client config\n",
"HTTP_HOST = os.environ.get('HTTP_HOST', \"localhost\")\n",
"HTTP_PORT = os.environ.get('HTTP_PORT', \"7670\")\n",
"TASK_QUEUE = os.environ.get('TASK_QUEUE', \"morpheus_task_queue\")\n",
"\n",
"# minio config\n",
"MINIO_ACCESS_KEY = os.environ.get('MINIO_ACCESS_KEY', \"minioadmin\")\n",
"MINIO_SECRET_KEY = os.environ.get('MINIO_SECRET_KEY', \"minioadmin\")\n",
"\n",
"# time to wait for job to complete\n",
"DEFAULT_JOB_TIMEOUT = 90\n",
"\n",
"# sample input file and output directory\n",
"SAMPLE_PDF = \"/workspace/nv-ingest/data/multimodal_test.pdf\""
]
},
{
"cell_type": "markdown",
"id": "61db5566",
"metadata": {},
"source": [
"## The NV-Ingest Python Client"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ccddd853",
"metadata": {},
"outputs": [],
"source": [
"from base64 import b64decode\n",
"import time\n",
"\n",
"from nv_ingest_client.client import NvIngestClient\n",
"from nv_ingest_client.message_clients.rest.rest_client import RestClient\n",
"from nv_ingest_client.primitives import JobSpec\n",
"from nv_ingest_client.primitives.tasks import DedupTask\n",
"from nv_ingest_client.primitives.tasks import EmbedTask\n",
"from nv_ingest_client.primitives.tasks import ExtractTask\n",
"from nv_ingest_client.primitives.tasks import FilterTask\n",
"from nv_ingest_client.primitives.tasks import SplitTask\n",
"from nv_ingest_client.primitives.tasks import StoreTask, StoreEmbedTask\n",
"from nv_ingest_client.primitives.tasks import VdbUploadTask\n",
"from nv_ingest_client.util.file_processing.extract import extract_file_content\n",
"\n",
"from IPython import display"
]
},
{
"cell_type": "markdown",
"id": "6bc54af4",
"metadata": {},
"source": [
"Load a sample PDF to demonstrate NV-Ingest usage."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16298e5f",
"metadata": {},
"outputs": [],
"source": [
"file_content, file_type = extract_file_content(SAMPLE_PDF)"
]
},
{
"cell_type": "markdown",
"id": "61306c09",
"metadata": {},
"source": [
"Initialize a client that will submit jobs to our NV-Ingest cluster."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d31e44c",
"metadata": {},
"outputs": [],
"source": [
"client = NvIngestClient(\n",
" message_client_allocator=RestClient,\n",
" message_client_hostname=HTTP_HOST,\n",
" message_client_port=HTTP_PORT,\n",
" message_client_kwargs=None,\n",
" msg_counter_id=\"nv-ingest-message-id\",\n",
" worker_pool_size=1,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "92211ced",
"metadata": {},
"source": [
"A `JobSpec` is a specification for creating a job for submission to the NV-Ingest microservice. It accepts the following parameters:\n",
"\n",
"- `document_type` : The file type of the file to be ingested.\n",
"- `payload` : A base64 encoded string of the file to be ingested.\n",
"- `source_id` : An identifier that maps to the file, our example uses the filename here.\n",
"- `source_name` : The name of the source for this ingest job, again, we use the filename.\n",
"- `extented_options` : A dictionary of additional options to pass in to the ingest job, we pass in informations allowing us to conduct performance tracing of the job.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d00dd717",
"metadata": {},
"outputs": [],
"source": [
"job_spec = JobSpec(\n",
" document_type=file_type,\n",
" payload=file_content,\n",
" source_id=SAMPLE_PDF,\n",
" source_name=SAMPLE_PDF,\n",
" extended_options={\n",
" \"tracing_options\": {\n",
" \"trace\": True,\n",
" \"ts_send\": time.time_ns(),\n",
" }\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "6a0a87cb",
"metadata": {},
"source": [
"Each ingest job will include a set of tasks. These tasks will define the operations that will be performed during ingestion. This allows each job to potentially have different ingestion instructions. Here we define a simple extract oriented job, but the full list of supported options are contained below:\n",
"\n",
"- `ExtractTask` : Performs multimodal extractions from a document, including text, images, and tables.\n",
"- `TableExtractTask`: Extracts the content from tables.\n",
"- `ChartExtractTask`: Extracts the content from charts.\n",
"- `SplitTask` : Chunk the text into smaller chunks, useful for storing in a vector database for retrieval applications.\n",
"- `Dedup` : Identifies duplicate images in document that can be filtered to remove data redundancy.\n",
"- `Filter` : Filters out images that are likely not useful using some heuristics, including size and aspect ratio.\n",
"- `EmbedTask` : Pass the text or table extractions through `\"nvidia/nv-embedqa-e5-v5` NIM to obtain its embeddings.\n",
"- `Store` : Save the extracted tables or images to an S3 compliant object store like MinIO.\n",
"- `Upload` : Save embeddings, chunks, and metadata to a Milvus vector database.\n",
"\n",
"After defining the ingestion tasks, they must be added to the job specification."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "061e5b4d",
"metadata": {},
"outputs": [],
"source": [
"extract_task = ExtractTask(\n",
" document_type=file_type,\n",
" extract_text=True,\n",
" extract_images=True,\n",
" extract_tables=True,\n",
" text_depth=\"document\",\n",
" extract_tables_method=\"yolox\",\n",
")\n",
"\n",
"dedup_task = DedupTask(\n",
" content_type=\"image\",\n",
" filter=True,\n",
")\n",
"\n",
"job_spec.add_task(extract_task)\n",
"job_spec.add_task(dedup_task)"
]
},
{
"cell_type": "markdown",
"id": "2da2f0ff",
"metadata": {},
"source": [
"A job identifier is created for the job specification. This is used to retrieve the results upon completion.\n",
"\n",
"With the `job_id`, the job is submitted to the NV-Ingest microservice. When the job is complete, the results are fetched.\n",
"\n",
"Note, many jobs can be submitted and asynchronous execution is supported."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9ad0589",
"metadata": {},
"outputs": [],
"source": [
"job_id = client.add_job(job_spec)\n",
"client.submit_job(job_id, TASK_QUEUE)\n",
"generated_metadata = client.fetch_job_result(\n",
" job_id, timeout=DEFAULT_JOB_TIMEOUT\n",
")[0]"
]
},
{
"cell_type": "markdown",
"id": "a833d21b",
"metadata": {},
"source": [
"## Explore the Outputs"
]
},
{
"cell_type": "markdown",
"id": "c5078539",
"metadata": {},
"source": [
"Let's explore elements of the NV-Ingest output. When data flows through an NV-Ingest pipeline, a number of extractions and transformations are performed. As the data is enriched, it is stored in rich metadata hierarchy. In the end, there will be a list of dictionaries, each of which represents a extracted type of information. The most common elements to extract from a dictionary in this hierarchy are the extracted content and the text representation of this content. The next few cells will demonstrate interacting with the metadata, pulling out these elements, and visualizing them. Note, when there is a `-1` value present, this represents non-applicable positional resolution. Positive numbers represent valid positional data.\n",
"\n",
"For a more complete description of metadata elements, view the data dictionary.\n",
"\n",
"[https://github.com/NVIDIA/nv-ingest/blob/main/docs/content-metadata.md](https://github.com/NVIDIA/nv-ingest/blob/main/docs/content-metadata.md)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94023458",
"metadata": {},
"outputs": [],
"source": [
"def redact_metadata_helper(metadata: dict) -> dict:\n",
" \"\"\"A simple helper function to redact `metadata[\"content\"]` so improve readability.\"\"\"\n",
" \n",
" text_metadata_redact = text_metadata.copy()\n",
" text_metadata_redact[\"content\"] = \"<---Redacted for readability--->\"\n",
" \n",
" return text_metadata_redact"
]
},
{
"cell_type": "markdown",
"id": "32838272",
"metadata": {},
"source": [
"### Explore Output - Text"
]
},
{
"cell_type": "markdown",
"id": "f99f8635",
"metadata": {},
"source": [
"This cell depicts the full metadata hierarchy for a text extraction with redacted content to ease readability. Notice the following sections are populated with information:\n",
"\n",
"- `content` - The raw extracted content, text in this case - this section will always be populated with a successful job.\n",
"- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.\n",
"- `source_metadata` - Describes the source document that is the basis of the ingest job.\n",
"- `text_metadata` - Contain information about the text extraction, including detected language, among others - this section will only exist when `metadata['content_metadata']['type'] == 'text'`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7c0dca1",
"metadata": {},
"outputs": [],
"source": [
"text_metadata = generated_metadata[3][\"metadata\"]\n",
"redact_metadata_helper(text_metadata)"
]
},
{
"cell_type": "markdown",
"id": "aa81e550",
"metadata": {},
"source": [
"View the text extracted from the sample document."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84c52623",
"metadata": {},
"outputs": [],
"source": [
"text_metadata[\"content\"]"
]
},
{
"cell_type": "markdown",
"id": "f0f2f01e",
"metadata": {},
"source": [
"### Explore Output - Tables"
]
},
{
"cell_type": "markdown",
"id": "9a180bca",
"metadata": {},
"source": [
"This cell depicts the full metadata hierarchy for a table extraction with redacted content to ease readability. Notice the following sections are populated with information:\n",
"\n",
"- `content` - The raw extracted content, a base64 encoded image of the extracted table in this case - this section will always be populated with a successful job.\n",
"- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.\n",
"- `source_metadata` - Describes the source and storage path of an extracted table in an S3 compliant object store.\n",
"- `table_metadata` - Contains the text representation of the table, positional data, and other useful elements - this section will only exist when `metadata['content_metadata']['type'] == 'structured'`.\n",
"\n",
"Note, `table_metadata` will store chart and table extractions. The are distringuished by `metadata['content_metadata']['subtype']`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10e44456",
"metadata": {},
"outputs": [],
"source": [
"table_metadata = generated_metadata[4][\"metadata\"]\n",
"redact_metadata_helper(table_metadata)"
]
},
{
"cell_type": "markdown",
"id": "aa3ec129",
"metadata": {},
"source": [
"Visualize the table contained within the extracted metadata."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e59c952",
"metadata": {},
"outputs": [],
"source": [
"display.Image(b64decode(table_metadata[\"content\"]))"
]
},
{
"cell_type": "markdown",
"id": "17eb3f8a",
"metadata": {},
"source": [
"View the corresponding text that maps to this table. This text could be embedded to support multimodal retrieval workflows."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1adb8706",
"metadata": {},
"outputs": [],
"source": [
"table_metadata[\"table_metadata\"][\"table_content\"]"
]
},
{
"cell_type": "markdown",
"id": "f36cbf7c",
"metadata": {},
"source": [
"### Explore Output - Charts"
]
},
{
"cell_type": "markdown",
"id": "f77f6680",
"metadata": {},
"source": [
"This cell depicts the full metadata hierarchy for a chart extraction with redacted content to ease readability. Notice the following sections are populated with information:\n",
"\n",
"- `content` - The raw extracted content, a base64 encoded image of the extracted chart in this case - this section will always be populated with a successful job.\n",
"- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.\n",
"- `source_metadata` - Describes the source and storage path of an extracted chart in an S3 compliant object store.\n",
"- `table_metadata` - Contains the text representation of the chart, positional data, and other useful elements - this section will only exist when `metadata['content_metadata']['type'] == 'structured'`.\n",
"\n",
"Note, `table_metadata` will store chart and table extractions. The are distringuished by `metadata['content_metadata']['subtype']`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc886cad",
"metadata": {},
"outputs": [],
"source": [
"chart_metadata = generated_metadata[7][\"metadata\"]\n",
"chart_metadata_redact = chart_metadata.copy()\n",
"chart_metadata_redact[\"content\"] = \"<---Redacted for readability--->\"\n",
"chart_metadata_redact"
]
},
{
"cell_type": "markdown",
"id": "7f07d604",
"metadata": {},
"source": [
"Visualize the chart contained within the extracted metadata."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40efa3f7",
"metadata": {},
"outputs": [],
"source": [
"display.Image(b64decode(chart_metadata[\"content\"]))"
]
},
{
"cell_type": "markdown",
"id": "a57372b0",
"metadata": {},
"source": [
"View the corresponding text that maps to this chart. This text could be embedded to support multimodal retrieval workflows."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fdb73ef9",
"metadata": {},
"outputs": [],
"source": [
"chart_metadata[\"table_metadata\"][\"table_content\"]"
]
},
{
"cell_type": "markdown",
"id": "8aa166f0",
"metadata": {},
"source": [
"### Explore Output - Images"
]
},
{
"cell_type": "markdown",
"id": "fa873466",
"metadata": {},
"source": [
"This cell depicts the full metadata hierarchy for a image extraction with redacted content to ease readability. Notice the following sections are populated with information:\n",
"\n",
"- `content` - The raw extracted content, a base64 encoded image extracted from the document in this case - this section will always be populated with a successful job.\n",
"- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.\n",
"- `source_metadata` - Describes the source and storage path of an extracted image in an S3 compliant object store.\n",
"- `image_metadata` - Contains the image type, positional data, and other useful elements - this section will only exist when `metadata['content_metadata']['type'] == 'image'`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d9e6ada",
"metadata": {},
"outputs": [],
"source": [
"img_metadata = generated_metadata[1][\"metadata\"]\n",
"redact_metadata_helper(img_metadata)"
]
},
{
"cell_type": "markdown",
"id": "c7658af1",
"metadata": {},
"source": [
"Visualize the image contained within the extracted metadata."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fd39c1f",
"metadata": {},
"outputs": [],
"source": [
"display.Image(b64decode(img_metadata[\"content\"]))"
]
},
{
"cell_type": "markdown",
"id": "6bf13b36-bc78-4f6c-81a3-5a0f901fdf09",
"metadata": {},
"source": [
"### Optional: Expanded Task Configuration"
]
},
{
"cell_type": "markdown",
"id": "d29ebd0d",
"metadata": {},
"source": [
"This section illustrates usage of the remaining task types used when supporting retrieval workflows.\n",
"\n",
"- `StoreTask` - Stores extracted content to an S3 compliant object store (MinIO by default) and updates the `source_metadata` with the corresponding stored location.\n",
"- `EmbedTask` - Computes an embedding for the extracted content using a [`nvidia/nv-embedqa-e5-v5`](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nv-embedqa-e5-v5) NVIDIA Inference Microservice (NIM) by default.\n",
"- `VdbUploadTask` - Inserts ingested content into a Milvus vector database to support retrieval use cases."
]
},
{
"cell_type": "markdown",
"id": "856422c5",
"metadata": {},
"source": [
"Define the ingest job specification."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97c8cd73",
"metadata": {},
"outputs": [],
"source": [
"job_spec = JobSpec(\n",
" document_type=file_type,\n",
" payload=file_content,\n",
" source_id=SAMPLE_PDF,\n",
" source_name=SAMPLE_PDF,\n",
" extended_options={\n",
" \"tracing_options\": {\n",
" \"trace\": True,\n",
" \"ts_send\": time.time_ns(),\n",
" }\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3e4bab16",
"metadata": {},
"source": [
"Here the task configuration is expanded, but requires the ancillary services (Embedding NIM, MinIO object stor, and Milvus Vector Database) to be up and running to return metadata back to the client."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "052ef358",
"metadata": {},
"outputs": [],
"source": [
"extract_task = ExtractTask(\n",
" document_type=file_type,\n",
" extract_text=True,\n",
" extract_images=True,\n",
" extract_tables=True,\n",
" text_depth=\"document\",\n",
" extract_tables_method=\"yolox\",\n",
")\n",
"\n",
"dedup_task = DedupTask(\n",
" content_type=\"image\",\n",
" filter=True,\n",
")\n",
"\n",
"filter_task = FilterTask(\n",
" content_type=\"image\",\n",
" min_size=128,\n",
" max_aspect_ratio=5.0,\n",
" min_aspect_ratio=0.2,\n",
" filter=True,\n",
")\n",
"\n",
"split_task = SplitTask(\n",
" split_by=\"word\",\n",
" split_length=300,\n",
" split_overlap=10,\n",
" max_character_length=5000,\n",
" sentence_window_size=0,\n",
")\n",
"\n",
"store_task = StoreTask(\n",
" structured=True,\n",
" images=True,\n",
" store_method=\"minio\",\n",
" params={\n",
" \"access_key\": MINIO_ACCESS_KEY, \n",
" \"secret_key\": MINIO_SECRET_KEY,\n",
" }\n",
")\n",
"\n",
"embed_task = EmbedTask(\n",
" text=True,\n",
" tables=True,\n",
")\n",
"\n",
"store_embed_task = StoreEmbedTask(\n",
" params={\n",
" \"access_key\": MINIO_ACCESS_KEY, \n",
" \"secret_key\": MINIO_SECRET_KEY,\n",
" }\n",
"\n",
")\n",
"\n",
"vdb_upload_task = VdbUploadTask(\n",
" bulk_ingest=True,\n",
" bulk_ingest_path=\"embeddings/\",\n",
" params={\n",
" \"access_key\": MINIO_ACCESS_KEY, \n",
" \"secret_key\": MINIO_SECRET_KEY,\n",
" }\n",
")\n",
"\n",
"\n",
"job_spec.add_task(extract_task)\n",
"job_spec.add_task(dedup_task)\n",
"job_spec.add_task(filter_task)\n",
"job_spec.add_task(split_task)\n",
"job_spec.add_task(store_task)\n",
"job_spec.add_task(embed_task)\n",
"job_spec.add_task(vdb_upload_task)"
]
},
{
"cell_type": "markdown",
"id": "deca53ee",
"metadata": {},
"source": [
"A job identifier is created for the job specification. This is used to retrieve the results upon completion.\n",
"\n",
"With the `job_id`, the job is submitted to the NV-Ingest cluster. When the job is complete, the results are fetched."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d471133b",
"metadata": {},
"outputs": [],
"source": [
"job_id = client.add_job(job_spec)\n",
"client.submit_job(job_id, TASK_QUEUE)\n",
"generated_metadata = client.fetch_job_result(\n",
" job_id, timeout=DEFAULT_JOB_TIMEOUT\n",
")[0][0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3386304-e3a8-40a7-a09a-34a72059404f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}