Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at Astra Doc Ingest logic #2969

Closed

Conversation

erichare
Copy link

@erichare erichare commented May 3, 2024

@potter-potter i don't have write access to your branch, so opened up a PR for my changes here

@potter-potter
Copy link
Contributor

potter-potter commented May 3, 2024

@erichare Let's continue working on this from this branch.

Is this running successfully for you when you test it?

@erichare
Copy link
Author

erichare commented May 3, 2024

@potter-potter i had something come up earlier and haven't had a chance to fully test, i will get back to you on that soon. Also sorry for the very undescriptive branch name here haha, I forgot to change the default 😄

I'll follow up soon

@potter-potter
Copy link
Contributor

I don't think we can fully sort of copy delta-table. I was just trying to point you to something that read from a "database table"-like source. There isn't really a good example of a source connector that parallels AstraDB as far as I can tell.

@erichare
Copy link
Author

erichare commented May 3, 2024

Gotchya. I tried to pull out the logic that was specific (i.e., metadata fields, we dont really have a great notion of), but the basic idea of pulling the data, storing as CSV with pandas, and then using that as the input data seemed to make sense for Astra too. but with that said, yeah, let me go ahead and test it more before wasting your time looking at this further 😄 if you see anything obvious to adjust definitely let me know though

@potter-potter
Copy link
Contributor

It actually looks like this gets the entire table:
astra_docs = list(self.astra_db_collection.paginated_find())

In most of the connectors, since they are not table based, get_ingest_docs just gets a pointer to the docs and then we can potentially fan out downloaders to download the docs with get_files

That being said, this is not a "docs" based connector (its table based) so it is a little different. I will look through our source connectors again to see if there is a better connector to model.

@erichare
Copy link
Author

erichare commented May 3, 2024

It actually looks like this gets the entire table: astra_docs = list(self.astra_db_collection.paginated_find())

In most of the connectors, since they are not table based, get_ingest_docs just gets a pointer to the docs and then we can potentially fan out downloaders to download the docs with get_files

That being said, this is not a "docs" based connector (its table based) so it is a little different. I will look through our source connectors again to see if there is a better connector to model.

That is a really great point. Yep, it will pull the full table. paginated_find itself can take a filter condition, if there's any way to pass kwargs to get_ingest_docs?

@potter-potter
Copy link
Contributor

potter-potter commented May 10, 2024

@erichare I haven't had time to really look at other examples. (mainly because I don't think there is one that maps quite right to this one) I'm going to redo the destination PR so we can get it merged and then dive into this source (Ingest) logic later.

@MthwRobinson
Copy link
Contributor

@potter-potter - Do you know if this PR is still being actively worked?

@erichare
Copy link
Author

@MthwRobinson I got pulled into some other stuff, but would like to get this merged... with that said, we sort of had two simultaneous PRs going, and i think this particular PR is out of date. If @potter-potter agrees, i'll open a new PR with just the Astra DB Source connector logic and we can close this one. I have some updates to make to it anyway as a result of recent changes on the Astra end.

@MthwRobinson
Copy link
Contributor

Thanks @erichare !

@potter-potter
Copy link
Contributor

potter-potter commented May 31, 2024

@erichare Thanks. If you could close this and open a new one. We merged in the updates to the Astra destination (via a different PR), but this source connector needed more work.

@erichare
Copy link
Author

@erichare Thanks. If you could close this and open a new one. We merged in the updates to the Astra destination (via a different PR), but this source connector needed more work.

I'm on it @potter-potter . Thanks!

@erichare erichare closed this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants