Batched reads for improved performance

mheyer · August 1, 2022, 2:24pm

Dear forum,

I am looking for a way to load a (long) list of entities, which may exist or not, by id.
Missing entities can just be omitted.

The goal is to be faster than looping over DataManager.id().one().

I already tried:
DataManager.ids().list() → throws Exception when one entity is missing
DataStore with LoadContext → the same, throws when one entity is missing
(DataStore seems to be hardcoded to fail on missing entities.)

It would also help if I could filter a list of ids against the database quickly - so I could request existing entities only.

Thank you very much

Martin

krivopustov · August 2, 2022, 2:01pm

Hi Martin,

You can use a query with “in” condition, passing a list of ids as a parameter. For example:

List<UUID> idList = //...

List<Customer> customers = dataManager.load(Customer.class)
        .query("select e from Customer e where e.id in :ids")
        .parameter("ids", idList)
        .list();

Or even shorter:

List<UUID> idList = //...

List<Customer> customers = dataManager.load(Customer.class)
        .query("e.id in ?1", idList)
        .list();

mheyer · August 3, 2022, 8:02am

Hi Konstantin,

thank you very much. I will try this out.

The idList might be in the range of 1E6+ but I am windowing it to 10,000 at a time already, so this should not create bottlenecks.

Just for completeness:
Do you know if the query solutions on its own scales well / will be optimized by EclipseLink? Would Jmix / EclipseLink use a clever optimization or would it put the whole list through jdbc by force?

Best regards,

Martin

PS: It is really fun to use Jmix.

krivopustov · August 3, 2022, 10:32am

Hi Martin,

There will be definitely no any optimization by Jmix or EclipseLink in this case. So even 10K ids seem too many - there will be a huge SQL select with 10K parameter placeholders.

Why in the first place you need to load so many entities? If you describe the whole task maybe someone suggest a better solution.

mheyer · August 3, 2022, 11:46am

Hi Konstantin,

imagine a reporting application which periodically syncs with upstream systems, e.g. via CSV exports from an ERP to create reports, answer user queries, etc.
It shall store upstream data as jmix entities, so they can be queried / joined locally, where a mere API call would not suffice Also API calls to ERP systems can be cumbersome…Ids can be mapped 1:1.

When upstream has changes to existing entities, those shall be updated.
As other entities might have references to the entity in question, it must be updated in place and cannot be deleted and recreated.
When upstream brings a new id, the corresponding entity shall be created.
Usually I would only transport partial CSV diffs / patches and update the changed entities - which is not much work. But sometimes the whole db must be re-synced with all 1E6+ entities.

In principle this works well. For now I have to query each entity separately via DataManager, which is OK but takes a few minutes. As EclipseLink has a batch-feature which accelerates writes, I just thought the same thing could be done for reads.

Best regards,

Martin