docling/CHANGELOG.md at 1a2a9e4eff9be804dfaed25fb4796afe66be6f60

alihan/docling

Fork 0

mirror of https://github.com/DS4SD/docling.git synced 2025-03-21 10:19:12 +03:00

Files

github-actions[bot] 1a2a9e4eff chore: bump version to 2.27.0 [skip ci]

2025-03-18 13:37:45 +00:00

63 KiB

Raw Blame History

v2.27.0 - 2025-03-18

Feature

Add factory for ocr engines via plugins (#1010) (6eaae3c)
Add DoclingParseV4 backend, using high-level docling-parse API (#905) (3960b19)
actor: Docling Actor on Apify infrastructure (#875) (772487f)
Equations to latex in MSWord backend (with inline groups) (#1114) (6eb718f)

Fix

html: Handle nested empty lists (#1154) (f94da44)
Use first table row as col headers (#1156) (0945973)
Pass tests, update docling-core to 2.22.0 (#1150) (aa92a57)

Documentation

Fix spelling of picture in usage (#1165) (7e01798)

v2.26.0 - 2025-03-11

Feature

Use new TableFormer model weights and default to accurate model version (#1100) (eb97357)

Fix

CLI: Fix help message for abort options (#1130) (4d64c4c)

Documentation

Add description of DOCLING_ARTIFACTS_PATH env var (#1124) (e1c49ad)

Performance

New revision code formula model and document picture classifier (#1140) (5e30381)

v2.25.2 - 2025-03-05

Fix

Proper handling of orphan IDs in layout postprocessing (#1118) (c56ab3a)

Documentation

Enrichment models (#1097) (357d41c)

v2.25.1 - 2025-03-03

Fix

Enable locks for threadsafe pdfium (#1052) (8dc0562)
html: Use 'start' attribute when parsing ordered lists from HTML docs (#1062) (de7b963)

Documentation

Improve docs on token limit warning triggered by HybridChunker (#1077) (db3ceef)

v2.25.0 - 2025-02-26

Feature

[Experimental] Introduce VLM pipeline using HF AutoModelForVision2Seq, featuring SmolDocling model (#1054) (3c9fe76)
cli: Add option for downloading all models, refine help messages (#1061) (ab683e4)

Fix

Vlm using artifacts path (#1057) (e197225)
html: Parse text in div elements as TextItem (#1041) (1b0ead6)

Documentation

Extend chunking docs, add FAQ on token limit (#1053) (c84b973)

v2.24.0 - 2025-02-20

Feature

Implement new reading-order model (#916) (c93e369)

v2.23.1 - 2025-02-20

Fix

Runtime error when Pandas Series is not always of string type (#1024) (6796f0a)

Documentation

Revamp picture description example (#1015) (27c0400)

v2.23.0 - 2025-02-17

Feature

Support cuda:n GPU device allocation (#694) (77eb77b)
xml-jats: Parse XML JATS documents (#967) (428b656)

Fix

Revise DocTags, fix iterate_items to output content_layer in items (#965) (6e75f0b)

v2.22.0 - 2025-02-14

Feature

Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) (00d9405)
Introduce the enable_remote_services option to allow remote connections while processing (#941) (2716c7d)
Allow artifacts_path to be defined as ENV (#940) (5101e25)

Fix

Update Pillow constraints (#958) (af19c03)
Fix the initialization of the TesseractOcrModel (#935) (c47ae70)

Documentation

Update example Dockerfile with download CLI (#929) (7493d5b)
Examples for picture descriptions (#951) (2d66e99)

v2.21.0 - 2025-02-10

Feature

Add content_layer property to items to address body, furniture and other roles (#735) (cf78d5b)

v2.20.0 - 2025-02-07

Feature

Describe pictures using vision models (#259) (4cc6e3e)

Fix

Remove unused httpx (#919) (c18f47c)

v2.19.0 - 2025-02-07

Feature

New artifacts path and CLI utility (#876) (ed74fe2)

Fix

markdown: Handle nested lists (#910) (90b766e)
Test cases for RTL programmatic PDFs and fixes for the formula model (#903) (9114ada)
msword_backend: Handle conversion error in label parsing (#896) (722a6eb)
Enrichment models batch size and expose picture classifier (#878) (5ad6de0)

Documentation

Introduce example with custom models for RapidOCR (#874) (6d3fea0)

v2.18.0 - 2025-02-03

Feature

Expose equation exports (#869) (6a76b49)
Add option to define page range (#852) (70d68b6)
docx: Support of SDTs in docx backend (#853) (d727b04)
Python 3.13 support (#841) (4df085a)

Fix

markdown: Fix parsing if doc ending with table (#873) (5ac2887)
markdown: Add support for HTML content (#855) (94751a7)
docx: Merged table cells not properly converted (#857) (0cd81a8)
Processing of placeholder shapes in pptx that have text but no bbox (#868) (eff16b6)
KeyError in tableformer prediction (#854) (b1cf796)
Fixed docx import with headers that are also lists (#842) (2c037ae)
Use new add_code in html backend and add more typing hints (#850) (2a1f8af)
markdown: Fix empty block handling (#843) (bccb022)
Fix for the crash when encountering WMF images in pptx and docx (#837) (fea0a99)

Documentation

Updated the readme with upcoming features (#831) (d7c0828)
Add example for inspection of picture content (#624) (f9144f2)

v2.17.0 - 2025-01-28

Feature

CLI: Expose code and formula models in the CLI (#820) (6882e6c)
Add platform info to CLI version printout (#816) (95b293a)
ocr: Expose rec_keys_path in RapidOcrOptions to support custom dictionaries (#786) (5332755)
Introduce automatic language detection in TesseractOcrCliModel (#800) (3be2fb5)

Fix

Fix single newline handling in MD backend (#824) (5aed9f8)
Use file extension if filetype fails with PDF (#827) (adf6353)
Parse html with omitted body tag (#818) (a112d7a)

Documentation

Document Docling JSON parsing (#819) (6875913)
Add SSL verification error mitigation (#821) (5139b48)
backend XML: Do not delete temp file in notebook (#817) (4d41db3)
Typo (#814) (8a4ec77)
Added markdown headings to enable TOC in github pages (#808) (b885b2f)
Description of supported formats and backends (#788) (c2ae1cc)

v2.16.0 - 2025-01-24

Feature

New document picture classifier (#805) (16a218d)
Add Docling JSON ingestion (#783) (88a0e66)
Code and equation model for PDF and code blocks in markdown (#752) (3213b24)
Add "auto" language for TesseractOcr (#759) (8543c22)

Fix

Added extraction of byte-images in excel (#804) (a458e29)
Update docling-parse-v2 backend version with new parsing fixes (#769) (670a08b)

Documentation

Fix minor typos (#801) (c58f75d)
Add Azure RAG example (#675) (9020a93)
Fix links between docs pages (#697) (c49b352)
Fix correct Accelerator pipeline options in docs/examples/custom_convert.py (#733) (7686083)
Example to translate documents (#739) (f7e1cbf)

v2.15.1 - 2025-01-10

Fix

Improve OCR results, stricten criteria before dropping bitmap areas (#719) (5a060f2)
Allow earlier requests versions (#716) (e64b5a2)

Documentation

Add pointers to LangChain-side docs (#718) (9a6b5c8)
Add LangChain docs (#717) (4fa8028)

v2.15.0 - 2025-01-08

Feature

Added http header support for document converter and cli (#642) (0ee849e)

Fix

Correct scaling of debug visualizations, tune OCR (#700) (5cb4cf6)
Let BeautifulSoup detect the HTML encoding (#695) (42856fd)
mspowerpoint: Handle invalid images in PowerPoint slides (#650) (d49650c)

Documentation

Specify docstring types (#702) (ead396a)
Add link to rag with granite (#698) (6701f34)
Add integrations, revamp docs (#693) (2d24fae)
Add OpenContracts as an integration (#679) (569038d)
Add Weaviate RAG recipe notebook (#451) (2b591f9)
Document Haystack & Vectara support (#628) (fc645ea)

v2.14.0 - 2024-12-18

Feature

Create a backend to transform PubMed XML files to DoclingDocument (#557) (fd03480)

v2.13.0 - 2024-12-17

Feature

Updated Layout processing with forms and key-value areas (#530) (60dc852)
Create a backend to parse USPTO patents into DoclingDocument (#606) (4e08750)
Add Easyocr parameter recog_network (#613) (3b53bd3)

Documentation

Add Haystack RAG example (#615) (3e599c7)
Fix the path to the run_with_accelerator.py example (#608) (3bb3bf5)

v2.12.0 - 2024-12-13

Feature

Introduce support for GPU Accelerators (#593) (19fad92)

v2.11.0 - 2024-12-12

Feature

Add timeout limit to document parsing job. DS4SD#270 (#552) (3da166e)

Fix

Do not import python modules from deepsearch-glm (#569) (aee9c0b)
Handle no result from RapidOcr reader (#558) (f45499c)
Make enum serializable with human-readable value (#555) (a7df337)

Documentation

Update chunking usage docs, minor reorg (#550) (d0c9e8e)

v2.10.0 - 2024-12-09

Feature

Docling-parse v2 as default PDF backend (#549) (aca57f0)

Fix

Call into docling-core for legacy document transform (#551) (7972d47)
Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544) (78f61a8)

v2.9.0 - 2024-12-09

Feature

Expose new hybrid chunker, update docs (#384) (c8ecdd9)
MS Word backend: Make detection of headers and other styles localization agnostic (#534) (3e073df)

Fix

Correcting DefaultText ID for MS Word backend (#537) (eb7ffcd)
Add py.typed marker file (#531) (9102fe1)
Enable HTML export in CLI and add options for image mode (#513) (0d11e30)
Missing text in docx (t tag) when embedded in a table (#528) (b730b2d)
Restore pydantic version pin after fixes (#512) (c830b92)
Folder input in cli (#511) (8ada0bc)

Documentation

Document new integrations (#532) (e780333)

v2.8.3 - 2024-12-03

Fix

Improve handling of disallowed formats (#429) (34c7c79)

v2.8.2 - 2024-12-03

Fix

ParserError EOF inside string (#470) (#472) (c90c41c)
PermissionError when using tesseract_ocr_cli_model (#496) (d3f84b2)

Documentation

Add styling for faq (#502) (5ba3807)
Typo in faq (#484) (33cff98)
Add automatic api reference (#475) (d487210)
Introduce faq section (#468) (8ccb3c6)

Performance

Prevent temp file leftovers, reuse core type (#487) (051789d)

v2.8.1 - 2024-11-29

Fix

cli: Expose debug options (#467) (dd8de46)
Remove unused deps (#466) (af63818)

Documentation

Extend integration docs & README (#456) (84c46fd)

v2.8.0 - 2024-11-27

Feature

ocr: Added support for RapidOCR engine (#415) (85b2999)

Fix

Use correct image index in word backend (#442) (767563b)
Update tests and examples for docling-core 2.5.1 (#449) (29807a2)

v2.7.1 - 2024-11-26

Fix

Fixes for wordx (#432) (d0a1180)
Force pydantic < 2.10.0 (#407) (d7072b4)

Documentation

Add DocETL, Kotaemon, spaCy integrations; minor docs improvements (#408) (7a45b92)

v2.7.0 - 2024-11-20

Feature

Add support for ocrmac OCR engine on macOS (#276) (6efa96c)

Fix

Python3.9 support (#396) (7b013ab)
Propagate document limits to converter (#388) (32ebf55)

v2.6.0 - 2024-11-19

Feature

Added support for exporting DocItem to an image when page image is available (#379) (3f91e7d)
Expose ocr-lang in CLI (#375) (ed785ea)
Added excel backend (#334) (926dfd2)
Extracting picture data for raster images found in PPTX (#349) (7a97d71)

Fix

Fixing images in the input Word files (#330) (8533039)
Reduce logging by keeping option for more verbose (#323) (8b437ad)

Documentation

Fixed typo in v2 example v2 (#378) (911c3bd)
Add automatic generation of CLI reference (#325) (ca8524e)
Add architecture outline (#341) (25fd149)
Fix parameter in usage.md (#332) (835e077)

v2.5.2 - 2024-11-13

Fix

Skip glm model downloads (#322) (c9341bf)

v2.5.1 - 2024-11-12

Fix

Handling of single-cell tables in DOCX backend (#314) (fb8ba86)

Documentation

Hybrid RAG with Qdrant (#312) (7f5d35e)
Add Data Prep Kit integration (#316) (93fc1be)

v2.5.0 - 2024-11-12

Feature

OCR: Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290) (c6b3763)

Fix

Configure env prefix for docling settings (#315) (5d4a10b)
Added handling of grouped elements in pptx backend (#307) (81c8243)
Allow mps usage for easyocr (#286) (97f214e)

Documentation

Add navigation indices (#305) (1239ade)

v2.4.2 - 2024-11-08

Fix

EasyOcrModel: Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr (#282) (0eb065e)

v2.4.1 - 2024-11-08

Fix

tesserocr: Raise Exception if tesserocr has not loaded any languages (#279) (704d792)
Dockerfile example copy command (#234) (90836db)

Documentation

Update badges & credits (#248) (a84ec27)
Add coming-soon section (#235) (5ce02c5)
Add artifacts-path param to CLI (#233) (d5e65ae)

v2.4.0 - 2024-11-04

Feature

Pdf backend, table mode as options and artifacts path (#203) (40ad987)

Documentation

Add explicit artifacts path example (#224) (eeee3b4)
Update custom convert and dockerfile (#226) (5f5fea9)
Correct spelling of 'individual' (#219) (41acaa9)
Update LlamaIndex docs (#196) (244ca69)

v2.3.1 - 2024-10-30

Fix

Simplify torch dependencies and update pinned docling deps (#190) (eb679cc)
Allow to explicitly initialize the pipeline (#189) (904d24d)

v2.3.0 - 2024-10-30

Feature

Add pipeline timings and toggle visualization, establish debug settings (#183) (2a2c65b)

Fix

Fix duplicate title and heading + add e2e tests for html and docx (#186) (f542460)

v2.2.1 - 2024-10-28

Fix

Fix header levels for DOCX & HTML (#184) (b9f5c74)
Handling of long sequence of unescaped underscore chars in markdown (#173) (94d0729)
HTML backend, fixes for Lists and nested texts (#180) (7d19418)
MD Backend, fixes to properly handle trailing inline text and emphasis in headers (#178) (88c1673)

Documentation

Update LlamaIndex docs for Docling v2 (#182) (2cece27)
Fix batch convert (#177) (189d3c2)
Add export with embedded images (#175) (8d356aa)

v2.2.0 - 2024-10-23

Feature

Update to docling-parse v2 without history (#170) (4116819)
Support AsciiDoc and Markdown input format (#168) (3023f18)

Fix

Set valid=false for invalid backends (#171) (3496b48)

v2.1.0 - 2024-10-18

Feature

Add coverage_threshold to skip OCR for small images (#161) (b346faf)

Fix

Fix legacy doc ref (#162) (63bef59)

Documentation

Typo fix (#155) (f799e77)
Add graphical band in readme (#154) (034a411)
Add use docling (#150) (61c092f)

v2.0.0 - 2024-10-16

Feature

Docling v2 (#117) (7d3be0e)

Breaking

Docling v2 (#117) (7d3be0e)

Documentation

Introduce docs site (#141) (d504432)

v1.20.0 - 2024-10-11

Feature

New experimental docling-parse v2 backend (#131) (5e4944f)

v1.19.1 - 2024-10-11

Fix

Remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138) (dae2a3b)

Documentation

Simplify LlamaIndex example using Docling extension (#135) (5f1bd9e)

v1.19.0 - 2024-10-08

Feature

Add options for choosing OCR engines (#118) (f96ea86)

v1.18.0 - 2024-10-03

Feature

New torch-based docling models (#120) (2422f70)

v1.17.0 - 2024-10-03

Feature

Windows support (#122) (d44c62d)

v1.16.1 - 2024-09-27

Fix

Allow usage of opencv 4.6.x (#110) (34bd887)

Documentation

Document chunking (#111) (c05b692)

v1.16.0 - 2024-09-27

Feature

Support tableformer model choice (#90) (d6df76f)

v1.15.0 - 2024-09-24

Feature

Add figure in markdown (#98) (6a03c20)

v1.14.0 - 2024-09-24

Feature

Add URL support to CLI (#99) (3c46e42)

Fix

Fix OCR setting for pypdfium, minor refactor (#102) (d96b96c)

Documentation

Document CLI, minor README revamp (#100) (f8f2303)

v1.13.1 - 2024-09-23

Fix

Updated the render_as_doctags with the new arguments from docling-core (#93) (4794ce4)

v1.13.0 - 2024-09-18

Feature

Add table exports (#86) (f19bd43)

Fix

Bumped the glm version and adjusted the tests (#83) (442443a)

Documentation

Updated Docling logo.png with transparent background (#88) (0da7519)

v1.12.2 - 2024-09-17

Fix

tests: Adjust the test data to match the new version of LayoutPredictor (#82) (fa9699f)

v1.12.1 - 2024-09-16

Fix

CLI compatibility with python 3.10 and 3.11 (#79) (2870fdc)

v1.12.0 - 2024-09-13

Feature

Add docling cli (#75) (9899078)

Documentation

Showcase RAG with LlamaIndex and LangChain (#71) (53569a1)

v1.11.0 - 2024-09-10

Feature

Adding txt and doctags output (#68) (bdfdfbf)

v1.10.0 - 2024-09-10

Feature

Linux arm64 support and reducing dependencies (#69) (27a7a15)

v1.9.0 - 2024-09-03

Feature

Export document pages as multimodal output (#54) (1de2e4f)

Documentation

Update MAINTAINERS.md (#59) (69e5d95)
Mention quackling on README (#58) (85b7348)

v1.8.5 - 2024-08-30

Fix

Add unit tests (#51) (48f4d1b)

v1.8.4 - 2024-08-30

Fix

Propagate row_section in tables (#57) (de85e46)

Documentation

Add instructions for cpu-only installation (#56) (a8a60d5)

v1.8.3 - 2024-08-28

Fix

Table cells overlap and model warnings (#53) (f49ee82)

v1.8.2 - 2024-08-27

Fix

Refine conversion result (#52) (e46a66a)

Documentation

Update interface in README (#50) (fe817b1)

v1.8.1 - 2024-08-26

Fix

Align output formats (#49) (8cc147b)

v1.8.0 - 2024-08-23

Feature

Page-level error reporting from PDF backend, introduce PARTIAL_SUCCESS status (#47) (a294b7e)

v1.7.1 - 2024-08-23

Fix

Better raise exception when a page fails to parse (#46) (8808463)
Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages (#45) (7e84533)

v1.7.0 - 2024-08-22

Feature

Upgrade docling-parse PDF backend and interface to use page-by-page parsing (#44) (a8c6b29)

v1.6.3 - 2024-08-22

Fix

Usage of bytesio with docling-parse (#43) (fac5745)

v1.6.2 - 2024-08-22

Fix

Remove [ocr] extra to fix wheel install (#42) (6995268)

v1.6.1 - 2024-08-21

Fix

Add scipy as dependency (#40) (f19871a)

v1.6.0 - 2024-08-20

Feature

Add adaptive OCR, factor out treatment of OCR areas and cell filtering (#38) (e94d317)

v1.5.0 - 2024-08-20

Feature

Allow computing page images on-demand with scale and cache them (#36) (78347bf)

Documentation

Add technical paper ref (#37) (a13114b)

v1.4.0 - 2024-08-14

Feature

Update parser with bytesio interface and set as new default backend (#32) (90dd676)

Fix

Allow newer torch versions (#34) (349b0e9)

v1.3.0 - 2024-08-12

Feature

Output page images and extracted bbox (#31) (63d80ed)

v1.2.1 - 2024-08-07

Fix

Update (vuln) deps (#29) (79ef8d2)
Type of path_or_stream in PdfDocumentBackend (#28) (794b20a)

Documentation

Improve examples (#27) (9550db8)

v1.2.0 - 2024-08-07

Feature

Introducing docling_backend (#26) (b8f5e38)

v1.1.2 - 2024-07-31

Fix

Set page number using 1-based indexing (#22) (d2d9543)

v1.1.1 - 2024-07-30

Fix

Correct text extraction for table cells (#21) (f4bf3d2)

v1.1.0 - 2024-07-26

Feature

Add simplified single-doc conversion (#20) (d603137)

v1.0.2 - 2024-07-24

Fix

Add easyocr to main deps for valid extra (#19) (54b3dda)

v1.0.1 - 2024-07-24

Fix

Expose ocr as extra (#18) (b0725e0)

v1.0.0 - 2024-07-18

Feature

V1.0.0 release (#16) (71c3a9c)

Breaking

v1.0.0 release (#16) (71c3a9c)

v0.4.0 - 2024-07-17

Feature

Optimize table extraction quality, add configuration options (#11) (e9526bb)

v0.3.1 - 2024-07-17

Fix

Missing type for default values (#12) (d1d1724)

Documentation

Reflect supported Python versions, add badges (#10) (2baa35c)

v0.3.0 - 2024-07-17

Feature

Enable python 3.12 support by updating glm (#8) (fb72688)

Documentation

Add setup with pypi to Readme (#7) (2803222)

v0.2.0 - 2024-07-16

Feature

Build with ci (#6) (b1479cf)

63 KiB Raw Blame History

v2.27.0 - 2025-03-18

Feature

Fix

Documentation

v2.26.0 - 2025-03-11

Feature

Fix

Documentation

Performance

v2.25.2 - 2025-03-05

Fix

Documentation

v2.25.1 - 2025-03-03

Fix

Documentation

v2.25.0 - 2025-02-26

Feature

Fix

Documentation

v2.24.0 - 2025-02-20

Feature

v2.23.1 - 2025-02-20

Fix

Documentation

v2.23.0 - 2025-02-17

Feature

Fix

v2.22.0 - 2025-02-14

Feature

Fix

Documentation

v2.21.0 - 2025-02-10

Feature

v2.20.0 - 2025-02-07

Feature

Fix

v2.19.0 - 2025-02-07

Feature

Fix

Documentation

v2.18.0 - 2025-02-03

Feature

Fix

Documentation

v2.17.0 - 2025-01-28

Feature

Fix

Documentation

v2.16.0 - 2025-01-24

Feature

Fix

Documentation

v2.15.1 - 2025-01-10

Fix

Documentation

v2.15.0 - 2025-01-08

Feature

Fix

Documentation

v2.14.0 - 2024-12-18

Feature

v2.13.0 - 2024-12-17

Feature

Documentation

v2.12.0 - 2024-12-13

Feature

v2.11.0 - 2024-12-12

Feature

Fix

Documentation

v2.10.0 - 2024-12-09

Feature

Fix

v2.9.0 - 2024-12-09

Feature

Fix

Documentation

v2.8.3 - 2024-12-03

Fix

63 KiB

Raw Blame History