mirror of
https://github.com/DS4SD/docling.git
synced 2025-03-21 10:19:12 +03:00
63 KiB
63 KiB
v2.27.0 - 2025-03-18
Feature
- Add factory for ocr engines via plugins (#1010) (
6eaae3c) - Add DoclingParseV4 backend, using high-level docling-parse API (#905) (
3960b19) - actor: Docling Actor on Apify infrastructure (#875) (
772487f) - Equations to latex in MSWord backend (with inline groups) (#1114) (
6eb718f)
Fix
- html: Handle nested empty lists (#1154) (
f94da44) - Use first table row as col headers (#1156) (
0945973) - Pass tests, update docling-core to 2.22.0 (#1150) (
aa92a57)
Documentation
v2.26.0 - 2025-03-11
Feature
Fix
Documentation
Performance
v2.25.2 - 2025-03-05
Fix
Documentation
v2.25.1 - 2025-03-03
Fix
- Enable locks for threadsafe pdfium (#1052) (
8dc0562) - html: Use 'start' attribute when parsing ordered lists from HTML docs (#1062) (
de7b963)
Documentation
v2.25.0 - 2025-02-26
Feature
- [Experimental] Introduce VLM pipeline using HF AutoModelForVision2Seq, featuring SmolDocling model (#1054) (
3c9fe76) - cli: Add option for downloading all models, refine help messages (#1061) (
ab683e4)
Fix
- Vlm using artifacts path (#1057) (
e197225) - html: Parse text in div elements as TextItem (#1041) (
1b0ead6)
Documentation
v2.24.0 - 2025-02-20
Feature
v2.23.1 - 2025-02-20
Fix
Documentation
v2.23.0 - 2025-02-17
Feature
- Support cuda:n GPU device allocation (#694) (
77eb77b) - xml-jats: Parse XML JATS documents (#967) (
428b656)
Fix
v2.22.0 - 2025-02-14
Feature
- Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) (
00d9405) - Introduce the enable_remote_services option to allow remote connections while processing (#941) (
2716c7d) - Allow artifacts_path to be defined as ENV (#940) (
5101e25)
Fix
- Update Pillow constraints (#958) (
af19c03) - Fix the initialization of the TesseractOcrModel (#935) (
c47ae70)
Documentation
- Update example Dockerfile with download CLI (#929) (
7493d5b) - Examples for picture descriptions (#951) (
2d66e99)
v2.21.0 - 2025-02-10
Feature
v2.20.0 - 2025-02-07
Feature
Fix
v2.19.0 - 2025-02-07
Feature
Fix
- markdown: Handle nested lists (#910) (
90b766e) - Test cases for RTL programmatic PDFs and fixes for the formula model (#903) (
9114ada) - msword_backend: Handle conversion error in label parsing (#896) (
722a6eb) - Enrichment models batch size and expose picture classifier (#878) (
5ad6de0)
Documentation
v2.18.0 - 2025-02-03
Feature
- Expose equation exports (#869) (
6a76b49) - Add option to define page range (#852) (
70d68b6) - docx: Support of SDTs in docx backend (#853) (
d727b04) - Python 3.13 support (#841) (
4df085a)
Fix
- markdown: Fix parsing if doc ending with table (#873) (
5ac2887) - markdown: Add support for HTML content (#855) (
94751a7) - docx: Merged table cells not properly converted (#857) (
0cd81a8) - Processing of placeholder shapes in pptx that have text but no bbox (#868) (
eff16b6) - KeyError in tableformer prediction (#854) (
b1cf796) - Fixed docx import with headers that are also lists (#842) (
2c037ae) - Use new add_code in html backend and add more typing hints (#850) (
2a1f8af) - markdown: Fix empty block handling (#843) (
bccb022) - Fix for the crash when encountering WMF images in pptx and docx (#837) (
fea0a99)
Documentation
- Updated the readme with upcoming features (#831) (
d7c0828) - Add example for inspection of picture content (#624) (
f9144f2)
v2.17.0 - 2025-01-28
Feature
- CLI: Expose code and formula models in the CLI (#820) (
6882e6c) - Add platform info to CLI version printout (#816) (
95b293a) - ocr: Expose
rec_keys_pathin RapidOcrOptions to support custom dictionaries (#786) (5332755) - Introduce automatic language detection in TesseractOcrCliModel (#800) (
3be2fb5)
Fix
- Fix single newline handling in MD backend (#824) (
5aed9f8) - Use file extension if filetype fails with PDF (#827) (
adf6353) - Parse html with omitted body tag (#818) (
a112d7a)
Documentation
- Document Docling JSON parsing (#819) (
6875913) - Add SSL verification error mitigation (#821) (
5139b48) - backend XML: Do not delete temp file in notebook (#817) (
4d41db3) - Typo (#814) (
8a4ec77) - Added markdown headings to enable TOC in github pages (#808) (
b885b2f) - Description of supported formats and backends (#788) (
c2ae1cc)
v2.16.0 - 2025-01-24
Feature
- New document picture classifier (#805) (
16a218d) - Add Docling JSON ingestion (#783) (
88a0e66) - Code and equation model for PDF and code blocks in markdown (#752) (
3213b24) - Add "auto" language for TesseractOcr (#759) (
8543c22)
Fix
- Added extraction of byte-images in excel (#804) (
a458e29) - Update docling-parse-v2 backend version with new parsing fixes (#769) (
670a08b)
Documentation
- Fix minor typos (#801) (
c58f75d) - Add Azure RAG example (#675) (
9020a93) - Fix links between docs pages (#697) (
c49b352) - Fix correct Accelerator pipeline options in docs/examples/custom_convert.py (#733) (
7686083) - Example to translate documents (#739) (
f7e1cbf)
v2.15.1 - 2025-01-10
Fix
- Improve OCR results, stricten criteria before dropping bitmap areas (#719) (
5a060f2) - Allow earlier requests versions (#716) (
e64b5a2)
Documentation
v2.15.0 - 2025-01-08
Feature
Fix
- Correct scaling of debug visualizations, tune OCR (#700) (
5cb4cf6) - Let BeautifulSoup detect the HTML encoding (#695) (
42856fd) - mspowerpoint: Handle invalid images in PowerPoint slides (#650) (
d49650c)
Documentation
- Specify docstring types (#702) (
ead396a) - Add link to rag with granite (#698) (
6701f34) - Add integrations, revamp docs (#693) (
2d24fae) - Add OpenContracts as an integration (#679) (
569038d) - Add Weaviate RAG recipe notebook (#451) (
2b591f9) - Document Haystack & Vectara support (#628) (
fc645ea)
v2.14.0 - 2024-12-18
Feature
v2.13.0 - 2024-12-17
Feature
- Updated Layout processing with forms and key-value areas (#530) (
60dc852) - Create a backend to parse USPTO patents into DoclingDocument (#606) (
4e08750) - Add Easyocr parameter recog_network (#613) (
3b53bd3)
Documentation
- Add Haystack RAG example (#615) (
3e599c7) - Fix the path to the run_with_accelerator.py example (#608) (
3bb3bf5)
v2.12.0 - 2024-12-13
Feature
v2.11.0 - 2024-12-12
Feature
Fix
- Do not import python modules from deepsearch-glm (#569) (
aee9c0b) - Handle no result from RapidOcr reader (#558) (
f45499c) - Make enum serializable with human-readable value (#555) (
a7df337)
Documentation
v2.10.0 - 2024-12-09
Feature
Fix
- Call into docling-core for legacy document transform (#551) (
7972d47) - Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544) (
78f61a8)
v2.9.0 - 2024-12-09
Feature
- Expose new hybrid chunker, update docs (#384) (
c8ecdd9) - MS Word backend: Make detection of headers and other styles localization agnostic (#534) (
3e073df)
Fix
- Correcting DefaultText ID for MS Word backend (#537) (
eb7ffcd) - Add
py.typedmarker file (#531) (9102fe1) - Enable HTML export in CLI and add options for image mode (#513) (
0d11e30) - Missing text in docx (t tag) when embedded in a table (#528) (
b730b2d) - Restore pydantic version pin after fixes (#512) (
c830b92) - Folder input in cli (#511) (
8ada0bc)
Documentation
v2.8.3 - 2024-12-03
Fix
v2.8.2 - 2024-12-03
Fix
- ParserError EOF inside string (#470) (#472) (
c90c41c) - PermissionError when using tesseract_ocr_cli_model (#496) (
d3f84b2)
Documentation
- Add styling for faq (#502) (
5ba3807) - Typo in faq (#484) (
33cff98) - Add automatic api reference (#475) (
d487210) - Introduce faq section (#468) (
8ccb3c6)
Performance
v2.8.1 - 2024-11-29
Fix
Documentation
v2.8.0 - 2024-11-27
Feature
Fix
- Use correct image index in word backend (#442) (
767563b) - Update tests and examples for docling-core 2.5.1 (#449) (
29807a2)
v2.7.1 - 2024-11-26
Fix
Documentation
v2.7.0 - 2024-11-20
Feature
Fix
v2.6.0 - 2024-11-19
Feature
- Added support for exporting DocItem to an image when page image is available (#379) (
3f91e7d) - Expose ocr-lang in CLI (#375) (
ed785ea) - Added excel backend (#334) (
926dfd2) - Extracting picture data for raster images found in PPTX (#349) (
7a97d71)
Fix
- Fixing images in the input Word files (#330) (
8533039) - Reduce logging by keeping option for more verbose (#323) (
8b437ad)
Documentation
- Fixed typo in v2 example v2 (#378) (
911c3bd) - Add automatic generation of CLI reference (#325) (
ca8524e) - Add architecture outline (#341) (
25fd149) - Fix parameter in usage.md (#332) (
835e077)
v2.5.2 - 2024-11-13
Fix
v2.5.1 - 2024-11-12
Fix
Documentation
v2.5.0 - 2024-11-12
Feature
- OCR: Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290) (
c6b3763)
Fix
- Configure env prefix for docling settings (#315) (
5d4a10b) - Added handling of grouped elements in pptx backend (#307) (
81c8243) - Allow mps usage for easyocr (#286) (
97f214e)
Documentation
v2.4.2 - 2024-11-08
Fix
- EasyOcrModel: Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr (#282) (
0eb065e)
v2.4.1 - 2024-11-08
Fix
- tesserocr: Raise Exception if tesserocr has not loaded any languages (#279) (
704d792) - Dockerfile example copy command (#234) (
90836db)
Documentation
- Update badges & credits (#248) (
a84ec27) - Add coming-soon section (#235) (
5ce02c5) - Add artifacts-path param to CLI (#233) (
d5e65ae)
v2.4.0 - 2024-11-04
Feature
Documentation
- Add explicit artifacts path example (#224) (
eeee3b4) - Update custom convert and dockerfile (#226) (
5f5fea9) - Correct spelling of 'individual' (#219) (
41acaa9) - Update LlamaIndex docs (#196) (
244ca69)
v2.3.1 - 2024-10-30
Fix
- Simplify torch dependencies and update pinned docling deps (#190) (
eb679cc) - Allow to explicitly initialize the pipeline (#189) (
904d24d)
v2.3.0 - 2024-10-30
Feature
Fix
v2.2.1 - 2024-10-28
Fix
- Fix header levels for DOCX & HTML (#184) (
b9f5c74) - Handling of long sequence of unescaped underscore chars in markdown (#173) (
94d0729) - HTML backend, fixes for Lists and nested texts (#180) (
7d19418) - MD Backend, fixes to properly handle trailing inline text and emphasis in headers (#178) (
88c1673)
Documentation
- Update LlamaIndex docs for Docling v2 (#182) (
2cece27) - Fix batch convert (#177) (
189d3c2) - Add export with embedded images (#175) (
8d356aa)
v2.2.0 - 2024-10-23
Feature
- Update to docling-parse v2 without history (#170) (
4116819) - Support AsciiDoc and Markdown input format (#168) (
3023f18)
Fix
v2.1.0 - 2024-10-18
Feature
Fix
Documentation
- Typo fix (#155) (
f799e77) - Add graphical band in readme (#154) (
034a411) - Add use docling (#150) (
61c092f)
v2.0.0 - 2024-10-16
Feature
Breaking
Documentation
v1.20.0 - 2024-10-11
Feature
v1.19.1 - 2024-10-11
Fix
- Remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138) (
dae2a3b)
Documentation
v1.19.0 - 2024-10-08
Feature
v1.18.0 - 2024-10-03
Feature
v1.17.0 - 2024-10-03
Feature
v1.16.1 - 2024-09-27
Fix
Documentation
v1.16.0 - 2024-09-27
Feature
v1.15.0 - 2024-09-24
Feature
v1.14.0 - 2024-09-24
Feature
Fix
Documentation
v1.13.1 - 2024-09-23
Fix
v1.13.0 - 2024-09-18
Feature
Fix
Documentation
v1.12.2 - 2024-09-17
Fix
v1.12.1 - 2024-09-16
Fix
v1.12.0 - 2024-09-13
Feature
Documentation
v1.11.0 - 2024-09-10
Feature
v1.10.0 - 2024-09-10
Feature
v1.9.0 - 2024-09-03
Feature
Documentation
v1.8.5 - 2024-08-30
Fix
v1.8.4 - 2024-08-30
Fix
Documentation
v1.8.3 - 2024-08-28
Fix
v1.8.2 - 2024-08-27
Fix
Documentation
v1.8.1 - 2024-08-26
Fix
v1.8.0 - 2024-08-23
Feature
v1.7.1 - 2024-08-23
Fix
- Better raise exception when a page fails to parse (#46) (
8808463) - Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages (#45) (
7e84533)