mirror of
https://github.com/NVIDIA/nv-ingest.git
synced 2025-01-05 18:58:13 +03:00
268 lines
7.9 KiB
Plaintext
268 lines
7.9 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5163ad8d-faf3-4754-9d48-38672bc20afd",
|
|
"metadata": {},
|
|
"source": [
|
|
"# NV-Ingest: CLI Client Quick Start Guide\n",
|
|
"\n",
|
|
"This notebook provides a quick start guide to using the NV-Ingest client to interact with a running NV-Ingest cluster. It will walk through the following:\n",
|
|
"\n",
|
|
"- Explore the CLI client help utility\n",
|
|
"- Submit a single file NV-Ingest job with the CLI client\n",
|
|
"- Submit a batch NV-Ingest job with the CLI client\n",
|
|
"- View NV-Ingest job outputs"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6e0a6279-e78f-412a-94e7-a64ce5c0e4df",
|
|
"metadata": {},
|
|
"source": [
|
|
"Specify a few notional files for testing and parameters to connect with a running NV-Ingest cluster."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "daf6a66e-1e8f-4d7c-ac70-91b2c79a9951",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"# sample input file and output directories\n",
|
|
"SAMPLE_PDF0 = \"/workspace/nv-ingest/data/multimodal_test.pdf\"\n",
|
|
"os.environ[\"SAMPLE_PDF0\"] = SAMPLE_PDF0\n",
|
|
"SAMPLE_PDF1 = \"/workspace/nv-ingest/data/functional_validation.pdf\"\n",
|
|
"BATCH_FILE = \"/workspace/client_examples/examples/dataset.json\"\n",
|
|
"os.environ[\"BATCH_FILE\"] = BATCH_FILE\n",
|
|
"OUTPUT_DIRECTORY_SINGLE = \"/workspace/client_examples/examples/processed_docs_single\"\n",
|
|
"OUTPUT_DIRECTORY_BATCH = \"/workspace/client_examples/examples/processed_docs_batch\"\n",
|
|
"os.environ[\"OUTPUT_DIRECTORY_SINGLE\"] = OUTPUT_DIRECTORY_SINGLE\n",
|
|
"os.environ[\"OUTPUT_DIRECTORY_BATCH\"] = OUTPUT_DIRECTORY_BATCH"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c5d7f9b6-9942-4ec6-b11c-aa2fb1a3855d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## The NV-Ingest CLI Client"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "69547089-554c-4643-a1b0-635655ac56d7",
|
|
"metadata": {},
|
|
"source": [
|
|
"This section will illustrate usage of the `nv-ingest-cli` client to submit ingest jobs to an up and running NV-Ingest cluster."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "71c88c9b-be77-4a65-b8c9-75241db1678c",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Help Utility"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "40d7d46b-0000-42a8-9e83-058f671feb90",
|
|
"metadata": {},
|
|
"source": [
|
|
"The CLI help utility will provide a description of settings and arguments that can be used to configure ingest jobs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "76a265f3-bcaa-4436-9338-9d6041c0d86e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!nv-ingest-cli --help"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c24c2e13-8f2b-484f-843b-d908f7ecea2e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Submitting a Single File Job"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "715fed85-ccb1-4c32-afa3-d5de4e19a6c1",
|
|
"metadata": {},
|
|
"source": [
|
|
"This section will demonstrate a CLI example that submits a single file extraction oriented NV-Ingest job and save outputs locally."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "0a4080c2-a245-46ee-9a1b-9e8281f7e6f8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!nv-ingest-cli \\\n",
|
|
" --doc ${SAMPLE_PDF0} \\\n",
|
|
" --task='extract:{\"document_type\": \"pdf\", \"extract_method\": \"pdfium\", \"extract_text\": true, \"extract_images\": true, \"extract_tables\": true, \"extract_tables_method\": \"yolox\"}' \\\n",
|
|
" --task='dedup:{\"content_type\": \"image\", \"filter\": true}' \\\n",
|
|
" --task='filter:{\"content_type\": \"image\", \"min_size\": 128, \"max_aspect_ratio\": 5.0, \"min_aspect_ratio\": 0.2, \"filter\": true}' \\\n",
|
|
" --client_host=${REDIS_HOST} \\\n",
|
|
" --client_port=${REDIS_PORT} \\\n",
|
|
" --output_directory=${OUTPUT_DIRECTORY_SINGLE}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e7cac54a-8dc5-44a5-be01-64bf3c9c930b",
|
|
"metadata": {},
|
|
"source": [
|
|
"The outputs will be saved locally for usage after job completion."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3eeb6d26-3935-423f-91cd-aa6dea774133",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!tree processed_docs_single"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d5fbf0c3-1172-4967-979a-2660a932c1ea",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Submitting a Batch Job"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dd642383-a031-41fa-aca8-011b16837b0a",
|
|
"metadata": {},
|
|
"source": [
|
|
"Alternatively, a batch job can be submitted using on a json file that includes list documents to be ingested. This json file will need to include the following keys:\n",
|
|
"\n",
|
|
"- `sampled_files` - A list of paths to files for the ingest job.\n",
|
|
"- `metadata` - Requires a `file_type_proportions` key with a sub-array that can be empty (this requirement may be deprecated in future versions).\n",
|
|
"\n",
|
|
"All files included have the same ingest task configuration applied to them."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5dcab38f-98f4-494a-bff7-03348e46c420",
|
|
"metadata": {},
|
|
"source": [
|
|
"Create a notional json file to demonstrate usage of the CLI batch job configuration."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "85abcd01-b011-4eef-9a26-dbb34b35f91f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"\n",
|
|
"batch_files = {\"sampled_files\": [SAMPLE_PDF0, SAMPLE_PDF1], \"metadata\": {\"file_type_proportions\": {}}}\n",
|
|
"\n",
|
|
"with open(BATCH_FILE, \"w\") as f:\n",
|
|
" json.dump(batch_files, f, indent=2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "28c47db5-0564-4ebc-9537-f7dfb88e97d5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!cat $BATCH_FILE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0fa7f839-0382-4905-804c-485163ed696d",
|
|
"metadata": {},
|
|
"source": [
|
|
"The results of this job will be stored locally in the same file hirearchy. The names of each file will map to the file name in the dataset file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5d2ee968-66ec-4d24-af53-5014e61b32ac",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!nv-ingest-cli \\\n",
|
|
" --dataset ${BATCH_FILE} \\\n",
|
|
" --task='extract:{\"document_type\": \"pdf\", \"extract_method\": \"pdfium\", \"extract_text\": true, \"extract_images\": true, \"extract_tables\": true, \"extract_tables_method\": \"yolox\"}' \\\n",
|
|
" --task='dedup:{\"content_type\": \"image\", \"filter\": true}' \\\n",
|
|
" --task='filter:{\"content_type\": \"image\", \"min_size\": 128, \"max_aspect_ratio\": 5.0, \"min_aspect_ratio\": 0.2, \"filter\": true}' \\\n",
|
|
" --client_host=${REDIS_HOST} \\\n",
|
|
" --client_port=${REDIS_PORT} \\\n",
|
|
" --output_directory=${OUTPUT_DIRECTORY_BATCH}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2c7e8bd9-9882-4acc-bd01-9dc53328e746",
|
|
"metadata": {},
|
|
"source": [
|
|
"The outputs will be saved locally for usage after job completion."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "33fb01e7-c08e-4505-960d-85fabff1d161",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!tree processed_docs_batch"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "aa5f0ec1-42c8-422d-a599-2e88618f958a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.12"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|