Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/Extract images in partition_html #3050

Open
jiarongkoh opened this issue May 19, 2024 · 1 comment
Open

feat/Extract images in partition_html #3050

jiarongkoh opened this issue May 19, 2024 · 1 comment
Labels
enhancement New feature or request needs follow up

Comments

@jiarongkoh
Copy link

Is your feature request related to a problem? Please describe.
I process HTML files and uses the partition_html function to do so. However, I noticed that this function is capable of extracting Tables as an elements, but not Images.

Describe the solution you'd like
I would like partition_html to be able to extract Images, like how shared.PartitionParameters is able to.

Describe alternatives you've considered
I have tried parsing the same HTML file into shared.PartitionParameters, but this also do not extract Images. One alternative I explored was to convert the HTML file to PDF. While this might be possible, it is not guaranteed that the conversion will still yield the same expected output.

Additional context
nil

@jiarongkoh jiarongkoh added the enhancement New feature or request label May 19, 2024
@MthwRobinson
Copy link
Contributor

Hi @jiarongkoh - thanks for the issue! We haven't supported image extraction from HTML in the past because images in HTML are linked rather than embedded directly in the document. We'll revisit internally though and follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs follow up
Projects
None yet
Development

No branches or pull requests

2 participants