Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
Short version:
I need to load some fields from records into a big text thingy.

The code runs on the server-side only.

I'm keen to preserve RAM.

What are the trade-offs in V17 between *GOTO SELECTED* record and *SELECTION
TO ARRAY*? I've been using *SELECTION TO ARRAY*, but it's hard to read,
write, and maintain. And, I realized, might be de-optimized for memory
because you have to load all of the data you're processing into arrays.
(Yes, you can chunk it, but that doesn't change the fundamental point that
you pre-load a lot of data.)

Any test results or thoughts? I considered a fair range of option and did
comparison tests on none. The long version below includes more details on
the two solutions I'm down to, plus the ideas that I discarded.


TL;DR version
I'm working in V17 and I'm hoping that someone has done some real-world
tests already that could help me out with a question. Here's the setup: I
need to load up some fields from lots of records and push them into an
external system. It's going to Postgres, but that's not an important
detail, the result is a ginormous text object. The result could just as
well be a text or JSON file dump. The main constraint is available memory.
Performance matters when there are millions of records but, typically, the
only important consideration is memory. As far as the final solution goes,
it's ideally code that's easy to write, read, and maintain. As a plus, we
can position the code to run server side, so client-server optimization
isn't an issue. And, for the record, in lots of cases there isn't enough
data to make memory an issue at all, so readable reliable code is
definitely a preference.

Note: Yes, I can chunk data in ranges, etc. to keep things within my memory
footprint. I'm doing that....but the question still remains

Here are the solutions I've come up with:

*QUERY* and a *For* loop with *GOTO SELECTED RECORD*.
Easy to read, write and maintain. But when you use *GOTO SELECTED RECORD*,
do you get the whole record in V17? Without fat fields? Since this is
server-side or stand-alone, should I care? On the upside, you're only
loading one record at a time, so only burning through memory for that
record while you use it.

*SELECTION TO ARRAY* and a *For* loop
This is what I have been doing....based on old habits as much as anything.
Yes, you only get the columns you want, but it gets _all_ of the rows at
once. So, you burn up a lot of memory with the arrays and then duplicate++
that memory when building up the output. On the code side, that kind
of *SELECTION
TO ARRAY*-loop-read by index code is ugly, tedious to write, and tedious to
maintain. It's clear(ish) and reliable, but only worth it if it pays for
itself somehow. In other words, it has to be a good deal better than *GOTO
SELECTED RECORD* to be worth it. Says the guy who has been doing all *SELECTION
TO ARRAY* forever.

Entity Selection and a *For* or *For each* loop
I have no clue why an entity selection is *C_OBJECT* instead of
*C_COLLECTION*, to give you a sense of how much I know about this stuff. I
was happy to discover that you can easily create an entity selection from a
current selection, so old style queries work fine:

*C_OBJECT*($stuff_es)
*QUERY*([Stuff];[Stuff]Counter>=10000)
$stuff_es:=*Create entity selection*([Stuff])

The resulting *For*/*For each* loop code is very readable, it's == *GOTO
SELECTED RECORD*, but with a different syntax. Otherwise, same same. I
*suspect* that the memory use here is excellent. I'm guessing that as you
navigate through the entity selection, you're only really pulling the data
you use. But maybe not. If you do a For each, you get an object (entity)
with all of the fields. So, possibly this is approach is even worse than *GOTO
SELECTED RECORD* which, I'm guessing, doesn't load as many fields. I
haven't tested these points out in any way. If anyone has dug into this, it
would be great to know about the difference (if any) in what 4D loads when
you:

-- Use *GOTO SELECTED RECORD*

-- Use a *For each* loop on an entity selection, which builds an
$entity_object which you can then read/write to/from like $entity_object.ID

-- Use a *For* loop on an entity selection and then reference
$specifc_es[0].ID

It's pretty easy to imagine different ways that 4D might have implemented
things that are more or less efficient in each of these days. I have no
idea what they actually did.I'm kind of curious about this behavior in V17,
but have already talked myself out of using entity selections. Why? Because
the table and field references are brittle and *case-sensitive*. Man, I
truly hate case-sensitive names. When do I want them? Never. Not once, and
I never will. This isn't all on 4D, many languages are case-sensitive. It
makes sense if you're a computer. I'm not a computer, I'm a person...to me
its just horrible. Anyway, not exclusively a 4D problem...because in 4D you
can avoid it altogether.

For those that haven't been following along at home, here's a hello world
level V17 For each loop over an entity selection:

*C_OBJECT*($stuff_object)
*For each* ($stuff_object;$stuff_es) // The loop automatically populates
$stuff_object as it iterates through the list.
$output_text+output_text+$stuff_object.ID+*Char*(*Carriage return*)
*End for each*
See that $stuff_object.ID statement? The ID part = [Stuff]ID. It's all
case-sensitive. Rename the field in the structure to id three months from
now and the code above breaks. And for "breaks", you don't get a compiler
error, you don't get a syntax error in the Method Editor, and you likely
don't get a runtime error. You code just screws up silently. So, yeah, not
going that way.

*Note*: Collections are very handy when the source data is a big static
JSON. It makes the static values highly interactive. I wrote a little
screen like that last week and loved the results.

*Note*: In a *For each* loop, I can't find a way to read the index of the
current item. Like, that you're on item 23. You can get the total item
count with .length, but I see no way to get the current index. Or on
collections. It can be useful when you've got a progress indicator to
update. You can always roll your own $index:=$index+1 sort of thing.
Reminder: All of the new V17 stuff is 0 (offset) indexed, not 1 (position)
indexed.

Honorable mention: *Selection to JSON*
Yeah, kind of nice...a very excellent command in some situations. In this
case, wildly wrong, I'd say. You load the whole JSON in one go so you get
your source data + formatting + names. It's pretty flabby. Then you have to
parse and walk that to get the proper text. If 4D had a Selection to text
(->Table;Template) system that was *not* JSON, I'd be golden. That would be
perfect. The *Selection to JSON* code doesn't allow in-line functions, so
there's that. Oh, wait, 4D does have a command like this...*PROCESS 4D TAGS*.
Hmmm. Yeah, probably the best approach for memory and the worst for
brittleness. Not going there.

Okay, so does anyone have any relevant, V17-based test results yet? I don't
have the time or appetite to do the tests myself and won't be surprised if
no one else has either. Not to be a **** about it, but I'm only interested
in *test results*. It's fun to estimate program behavior from first
principles, but it has pretty much zero predictive value. Having just spewn
out a bunch of speculation, I certainly can't hold it against anyone else
for riffing too.

I've spent some embarrassing number of hours (for hours read "months") of
my life testing 4D performance and, well, you have to test to find out.
Conventional wisdom tends to be *worse* than random guessing. It's great to
hear theories and stories from the folks at 4D, but that's all they
are...stories and theories. Background information can give you a better
idea of what to test and where to look, but that's all. Modern machines +
modern OS + 4D + your code + all of the various subcomponents (RAM,
network, SSD)...it's a lot. So, it's not a criticism to say only testing
can hope to turn up meaningful results. Given all of those factors, narrow
test are ideal and obviously can't be generalized too far. Still, lots
better than speculation!

Thanks.
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
Hi

GOTO SELECTED RECORD (or any ORDA equivalent of navigating record by record) will be slower than S2A but will take less RAM.  That's based on actual experience.

Building up a single big text thingy will have RAM (and eventually potentially performance) issues if using straight string/text concatenation for each iteration.  Because each time you add something to a string, a copy is made in RAM which is bound to lead to memory fragmentation and a higher workload for garbage collection.   How significant is this?  No idea.  I've always avoided building up super-massive text as a single thing.

We use GOTO SELECTED RECORD for building up SQL INSERT statements for sending to a (Microsoft or MySQL) server.  We send the SQL text to the remote server every x iterations which makes it fairly slow going, but it keeps the RAM in reasonable control.

Hope that helps.

Best regards

Keith White
Synergist Express Ltd, UK.
4697775
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
Thanks Keith, that all sounds pretty right to me too. And, yes, I'm
chunking the pushes into batches to avoid a truly massive text object, when
necessary. I don't know about with SQL Server and don't remember about
MySQL, but with Postgres it's important to avoid single row inserts where
possible. Postgres is absolutely fanatical about integrity, so each
operation is automatically wrapped in a transaction. Pushing rows
one-by-one can, therefore, be surprisingly slow. So I bundle multiple rows
into a mutli-value insert, and then several of those into a transaction. It
goes pretty quickly and for smaller pushes, it probably doesn't matter how
I build up the text, it just isn't that big. (Note for the PG people in the
crowd, I'm using an UPSERT, so COPY TO isn't an option.)

Regarding the ever-resizing text block, I don't know how that behaves
today. I used to know...and knew the sort of crazy stuff you could do to
improve performance. Like, pre-size a giant block, keep track of the last
byte written and then write into bytes by hand. So, you didn't resize the
block as you went along. You might need a final trim or an expansion...but
you pretty much kept the block as a set size and then write into it by
direct character reference, etc. It's a lot more work and I have no idea
how modern versions of 4D on modern versions of Mac/Win handle all of this.
I have a notion in the back of my mind that all of that kind of byte
tracking stuff isn't too useful any more. I notice that both 4D and
Postgres don't have VARCHAR types internally, it's all variable length
text. Haven't testing anything. Oh, we don't have any double-byte data to
worry about.

Thanks again!
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
In reply to this post by 4D Tech mailing list
With a little up front work, you can create a nice wrapper for SELECTION TO ARRAY/SELECTION RANGE TO ARRAY where you just pass an array of field pointers. You can use a process 2D array of each type to dynamically allocate/deallocate the necessary arrays and just manipulate the pointers. Some code fragments below.


For ($iParam;1;Size of array($aSource))
  $aDest{$iParam}:=PM_Variable (ARRAY_Type ($aType{$iParam}))
End for


For ($iParam;1;Size of array($aSource))
  SELECTION TO ARRAY($aSource{$iParam}->;$aDest{$iParam}->;*)
End for


SELECTION RANGE TO ARRAY($start;$end)


John DeSoi, Ph.D.


> On Sep 15, 2018, at 2:07 AM, David Adams via 4D_Tech <[hidden email]> wrote:
>
> What are the trade-offs in V17 between *GOTO SELECTED* record and *SELECTION
> TO ARRAY*? I've been using *SELECTION TO ARRAY*, but it's hard to read,
> write, and maintain.

**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
In reply to this post by 4D Tech mailing list
Hey David,

You know, you say that RAM is the primary limiting factor but then also say
you're only interested in actual timing tests. Umm. Seems like the coding
version of the the building contractor talking to a client:

You can have it fast
You can have it cheap
You can have it done well

Pick 2

Sub 'small RAM footprint' for cheap and 'reliable code' for 'done well' and
here we are. ;-)

I haven't done any timing tests of ORDA vs classic (as I call it) 4D. So
I'm rattling off in exactly the way you didn't want. That's what friends
are for.

I suspect ORDA is a bit slower on some things. I also suspect that will
change in coming updates. But a more relevant point I've noticed is ORDA
really seems to be a different internal engine. Consequently I don't think
the existence of ORDA has changed classic 4D ops. I expect something like
GOTO RECORD is doing the same thing in v17 it did before. There might be
some ad hoc optimization but I doubt it's much.

I also doubt there's much of a benefit reaching for one or two ORDA
commands in the midst of another 'classic' set of commands. The reason
being ORDA is using a different approach to manipulating the data. ORDA is
managing references to data while classic more often manages actual data.
GOTO RECORD being a great example. When I hear Thomas Maul talk about
various things being "super optimized" in ORDA I take that to refer to this
aspect of the way it works. To use ORDA commands you have to have an ORDA
data structure. Easy enough to create but these are not the same as the
classic ones (current record, current selection, sets, etc.). So there's
that 'penalty' to pay moving from one approach to the other. This leads me
to conclude I need to make a choice when deciding how a particular process
(not referring to a 4D process, just a segment of code) is going to be
built: classic or ORDA.

So this case sounds like there are really two considerations that impact
RAM - 1) selecting or getting the data and 2) building the text thingy. I
bet you would see less RAM consumed using strictly ORDA commands because
you're going to have millions of references instead of millions of
representations of the data.

Then comes assembling the text and for that I don't think it will matter
how you manage the data. A couple of thoughts:
 - open a disk file and write the text to it
 - write the text to a record

Both these are slower than RAM so maybe you look at the number of records
and use this when it's large and RAM when it's small.

And just to be on the record I know there's no technical reason you can't
mix ORDA and classic commands in the same little chunk of code. But if you
start off using classic 4D there's some point where you need to build an
ORDA representation of that data. At that point you've spent time doing the
transfer and now you essentially have two versions of the same data. For
most things this doesn't matter. For others, like millions of records, I
think it does. Plus the suite of ORDA commands does actually allow you
pretty much all the capabilities we're accustomed to. You just have to
think in that context. Which is a PITA when you've got years of classic 4D
know how.

On Sat, Sep 15, 2018 at 12:08 AM David Adams via 4D_Tech <
[hidden email]> wrote:

> Short version:
> I need to load some fields from records into a big text thingy.
>
> The code runs on the server-side only.
>
> I'm keen to preserve RAM.
>
> What are the trade-offs in V17 between *GOTO SELECTED* record and
> *SELECTION
> TO ARRAY*? I've been using *SELECTION TO ARRAY*, but it's hard to read,
> write, and maintain. And, I realized, might be de-optimized for memory
> because you have to load all of the data you're processing into arrays.
> (Yes, you can chunk it, but that doesn't change the fundamental point that
> you pre-load a lot of data.)
>
> Any test results or thoughts? I considered a fair range of option and did
> comparison tests on none. The long version below includes more details on
> the two solutions I'm down to, plus the ideas that I discarded.
>
>
> TL;DR version
> I'm working in V17 and I'm hoping that someone has done some real-world
> tests already that could help me out with a question. Here's the setup: I
> need to load up some fields from lots of records and push them into an
> external system. It's going to Postgres, but that's not an important
> detail, the result is a ginormous text object. The result could just as
> well be a text or JSON file dump. The main constraint is available memory.
> Performance matters when there are millions of records but, typically, the
> only important consideration is memory. As far as the final solution goes,
> it's ideally code that's easy to write, read, and maintain. As a plus, we
> can position the code to run server side, so client-server optimization
> isn't an issue. And, for the record, in lots of cases there isn't enough
> data to make memory an issue at all, so readable reliable code is
> definitely a preference.
>
> Note: Yes, I can chunk data in ranges, etc. to keep things within my memory
> footprint. I'm doing that....but the question still remains
>
> Here are the solutions I've come up with:
>
> *QUERY* and a *For* loop with *GOTO SELECTED RECORD*.
> Easy to read, write and maintain. But when you use *GOTO SELECTED RECORD*,
> do you get the whole record in V17? Without fat fields? Since this is
> server-side or stand-alone, should I care? On the upside, you're only
> loading one record at a time, so only burning through memory for that
> record while you use it.
>
> *SELECTION TO ARRAY* and a *For* loop
> This is what I have been doing....based on old habits as much as anything.
> Yes, you only get the columns you want, but it gets _all_ of the rows at
> once. So, you burn up a lot of memory with the arrays and then duplicate++
> that memory when building up the output. On the code side, that kind
> of *SELECTION
> TO ARRAY*-loop-read by index code is ugly, tedious to write, and tedious to
> maintain. It's clear(ish) and reliable, but only worth it if it pays for
> itself somehow. In other words, it has to be a good deal better than *GOTO
> SELECTED RECORD* to be worth it. Says the guy who has been doing all
> *SELECTION
> TO ARRAY* forever.
>
> Entity Selection and a *For* or *For each* loop
> I have no clue why an entity selection is *C_OBJECT* instead of
> *C_COLLECTION*, to give you a sense of how much I know about this stuff. I
> was happy to discover that you can easily create an entity selection from a
> current selection, so old style queries work fine:
>
> *C_OBJECT*($stuff_es)
> *QUERY*([Stuff];[Stuff]Counter>=10000)
> $stuff_es:=*Create entity selection*([Stuff])
>
> The resulting *For*/*For each* loop code is very readable, it's == *GOTO
> SELECTED RECORD*, but with a different syntax. Otherwise, same same. I
> *suspect* that the memory use here is excellent. I'm guessing that as you
> navigate through the entity selection, you're only really pulling the data
> you use. But maybe not. If you do a For each, you get an object (entity)
> with all of the fields. So, possibly this is approach is even worse than
> *GOTO
> SELECTED RECORD* which, I'm guessing, doesn't load as many fields. I
> haven't tested these points out in any way. If anyone has dug into this, it
> would be great to know about the difference (if any) in what 4D loads when
> you:
>
> -- Use *GOTO SELECTED RECORD*
>
> -- Use a *For each* loop on an entity selection, which builds an
> $entity_object which you can then read/write to/from like $entity_object.ID
>
> -- Use a *For* loop on an entity selection and then reference
> $specifc_es[0].ID
>
> It's pretty easy to imagine different ways that 4D might have implemented
> things that are more or less efficient in each of these days. I have no
> idea what they actually did.I'm kind of curious about this behavior in V17,
> but have already talked myself out of using entity selections. Why? Because
> the table and field references are brittle and *case-sensitive*. Man, I
> truly hate case-sensitive names. When do I want them? Never. Not once, and
> I never will. This isn't all on 4D, many languages are case-sensitive. It
> makes sense if you're a computer. I'm not a computer, I'm a person...to me
> its just horrible. Anyway, not exclusively a 4D problem...because in 4D you
> can avoid it altogether.
>
> For those that haven't been following along at home, here's a hello world
> level V17 For each loop over an entity selection:
>
> *C_OBJECT*($stuff_object)
> *For each* ($stuff_object;$stuff_es) // The loop automatically populates
> $stuff_object as it iterates through the list.
> $output_text+output_text+$stuff_object.ID+*Char*(*Carriage return*)
> *End for each*
> See that $stuff_object.ID statement? The ID part = [Stuff]ID. It's all
> case-sensitive. Rename the field in the structure to id three months from
> now and the code above breaks. And for "breaks", you don't get a compiler
> error, you don't get a syntax error in the Method Editor, and you likely
> don't get a runtime error. You code just screws up silently. So, yeah, not
> going that way.
>
> *Note*: Collections are very handy when the source data is a big static
> JSON. It makes the static values highly interactive. I wrote a little
> screen like that last week and loved the results.
>
> *Note*: In a *For each* loop, I can't find a way to read the index of the
> current item. Like, that you're on item 23. You can get the total item
> count with .length, but I see no way to get the current index. Or on
> collections. It can be useful when you've got a progress indicator to
> update. You can always roll your own $index:=$index+1 sort of thing.
> Reminder: All of the new V17 stuff is 0 (offset) indexed, not 1 (position)
> indexed.
>
> Honorable mention: *Selection to JSON*
> Yeah, kind of nice...a very excellent command in some situations. In this
> case, wildly wrong, I'd say. You load the whole JSON in one go so you get
> your source data + formatting + names. It's pretty flabby. Then you have to
> parse and walk that to get the proper text. If 4D had a Selection to text
> (->Table;Template) system that was *not* JSON, I'd be golden. That would be
> perfect. The *Selection to JSON* code doesn't allow in-line functions, so
> there's that. Oh, wait, 4D does have a command like this...*PROCESS 4D
> TAGS*.
> Hmmm. Yeah, probably the best approach for memory and the worst for
> brittleness. Not going there.
>
> Okay, so does anyone have any relevant, V17-based test results yet? I don't
> have the time or appetite to do the tests myself and won't be surprised if
> no one else has either. Not to be a **** about it, but I'm only interested
> in *test results*. It's fun to estimate program behavior from first
> principles, but it has pretty much zero predictive value. Having just spewn
> out a bunch of speculation, I certainly can't hold it against anyone else
> for riffing too.
>
> I've spent some embarrassing number of hours (for hours read "months") of
> my life testing 4D performance and, well, you have to test to find out.
> Conventional wisdom tends to be *worse* than random guessing. It's great to
> hear theories and stories from the folks at 4D, but that's all they
> are...stories and theories. Background information can give you a better
> idea of what to test and where to look, but that's all. Modern machines +
> modern OS + 4D + your code + all of the various subcomponents (RAM,
> network, SSD)...it's a lot. So, it's not a criticism to say only testing
> can hope to turn up meaningful results. Given all of those factors, narrow
> test are ideal and obviously can't be generalized too far. Still, lots
> better than speculation!
>
> Thanks.
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> Archive:  http://lists.4d.com/archives.html
> Options: https://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:[hidden email]
> **********************************************************************



--
Kirk Brooks
San Francisco, CA
=======================

*We go vote - they go home*
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
In reply to this post by 4D Tech mailing list
On Sep 15, 2018, at 12:07 AM, David Adams via 4D_Tech <[hidden email]> wrote:
>
> Short version:
> I need to load some fields from records into a big text thingy.
>
> The code runs on the server-side only.
>
> I'm keen to preserve RAM.

[snip]

> TL;DR version
> I'm working in V17 and I'm hoping that someone has done some real-world
> tests already that could help me out with a question. Here's the setup: I
> need to load up some fields from lots of records and push them into an
> external system. It's going to Postgres, but that's not an important
> detail, the result is a ginormous text object. The result could just as
> well be a text or JSON file dump. The main constraint is available memory.


Are you generating large enough text objects to be constrained by memory? If so, is more memory an option? It’s relatively cheap, probably cheaper than the time taken to run tests.

Since you’re running server side, looping through records is going to be fast enough that you shouldn’t see any meaningful difference in timing between selection to array and GOTO SELECTED RECORD, especially if the server is using SSDs.  In any case, it should be faster to loop through reading records than for Postgres to do a search+insert/update; for this particular case, do you need to build text faster than Postgres can process it? If the Postgres server is not on a LAN, then WAN speed could end up being the primary bottleneck and how fast you can build text may be irrelevant.

I don’t think anyone could have meaningful test results, since you’re asking about memory usage and also mentioning timing. But, my opinion is if you’re building text blocks under 2 MB or so, there is no point to test anything. If you’re building text blocks over 5MB small pieces at a time, the only test that really matters is appending to a text vs putting text into a pre-sized blob, especially under memory pressure. If your text object has to be copied to a different place in memory 10K times plus forcing memory reorganization several times (assuming 4D even reorganizes its own memory) because you’re appending to a large text variable in a low-memory situation, then whether you used SELECTION TO ARRAY or GOTO SELECTED RECORD on the server probably doesn’t matter.

If you want to use arrays, the query is fast, and timing is not as critical, using SQL with LIMIT/OFFSET gets you arrays without the memory load of having to load an entire selection of millions of records into arrays.

If the server runs an SSD, memory is truly a problem, and the SSD has plenty of space, then building a text file is probably a reasonably fast solution.


> So I bundle multiple rows
> into a mutli-value insert, and then several of those into a transaction. It
> goes pretty quickly and for smaller pushes, it probably doesn't matter how
> I build up the text, it just isn't that big. (Note for the PG people in the
> crowd, I'm using an UPSERT, so COPY TO isn't an option.)

Using a staging table, you can effectively combine upsert with copy. Besides the potential speed increase on the Postgres side, not having to build SQL statements with 4D’s weak text-building tools is a huge plus.

https://www.flydata.com//blog/how-to-improve-performance-upsert-amazon-redshift/

Jim Crate

**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
Following up on Jim’s and other responses I would venture to say that one good solution memory and performance wise would be:
- loop thru records using NEXT RECORD, or GOTO SELECTED RECORD
- preallocate a BLOB to build your text/json; say 5MB is your limit, or maybe 50Mb, or 500Mb, you pick your number
- fill in the BLOB with your text/json, one record at a time, using INSERT IN BLOB
- when the BLOB is full, or you’re done with the selection, APPEND the blob to your output file, and reset the BLOB offset to 0

running server side on SSD, that should probably a good alternative, I think ;-) (no testing done)

hth
julio

> On Sep 15, 2018, at 7:33 PM, Jim Crate via 4D_Tech <[hidden email]> wrote:
>
> On Sep 15, 2018, at 12:07 AM, David Adams via 4D_Tech <[hidden email]> wrote:
>>
>> Short version:
>> I need to load some fields from records into a big text thingy.
>>
>> The code runs on the server-side only.
>>
>> I'm keen to preserve RAM.
>
> [snip]
>
>> TL;DR version
>> I'm working in V17 and I'm hoping that someone has done some real-world
>> tests already that could help me out with a question. Here's the setup: I
>> need to load up some fields from lots of records and push them into an
>> external system. It's going to Postgres, but that's not an important
>> detail, the result is a ginormous text object. The result could just as
>> well be a text or JSON file dump. The main constraint is available memory.
>
>
> Are you generating large enough text objects to be constrained by memory? If so, is more memory an option? It’s relatively cheap, probably cheaper than the time taken to run tests.
>
> Since you’re running server side, looping through records is going to be fast enough that you shouldn’t see any meaningful difference in timing between selection to array and GOTO SELECTED RECORD, especially if the server is using SSDs.  In any case, it should be faster to loop through reading records than for Postgres to do a search+insert/update; for this particular case, do you need to build text faster than Postgres can process it? If the Postgres server is not on a LAN, then WAN speed could end up being the primary bottleneck and how fast you can build text may be irrelevant.
>
> I don’t think anyone could have meaningful test results, since you’re asking about memory usage and also mentioning timing. But, my opinion is if you’re building text blocks under 2 MB or so, there is no point to test anything. If you’re building text blocks over 5MB small pieces at a time, the only test that really matters is appending to a text vs putting text into a pre-sized blob, especially under memory pressure. If your text object has to be copied to a different place in memory 10K times plus forcing memory reorganization several times (assuming 4D even reorganizes its own memory) because you’re appending to a large text variable in a low-memory situation, then whether you used SELECTION TO ARRAY or GOTO SELECTED RECORD on the server probably doesn’t matter.
>
> If you want to use arrays, the query is fast, and timing is not as critical, using SQL with LIMIT/OFFSET gets you arrays without the memory load of having to load an entire selection of millions of records into arrays.
>
> If the server runs an SSD, memory is truly a problem, and the SSD has plenty of space, then building a text file is probably a reasonably fast solution.
>
>
>> So I bundle multiple rows
>> into a mutli-value insert, and then several of those into a transaction. It
>> goes pretty quickly and for smaller pushes, it probably doesn't matter how
>> I build up the text, it just isn't that big. (Note for the PG people in the
>> crowd, I'm using an UPSERT, so COPY TO isn't an option.)
>
> Using a staging table, you can effectively combine upsert with copy. Besides the potential speed increase on the Postgres side, not having to build SQL statements with 4D’s weak text-building tools is a huge plus.
>
> https://www.flydata.com//blog/how-to-improve-performance-upsert-amazon-redshift/
>
> Jim Crate
>
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> Archive:  http://lists.4d.com/archives.html
> Options: https://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:[hidden email]
> **********************************************************************

--
Julio Carneiro
[hidden email]



**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
Hello people, thanks for all of the suggestions and feedback. Several of
you pointed out that my thinking was in the fuzzy-to-nonsense range. I hate
you all. So much. Not really, the comments were completely on the mark. I'm
trying to find a decent default option that will work over a wide range of
conditions. Some tables are tiny, some are very large. Some pushes have a
lot of data, some have vey little. Most machines are up on AWS, a few are
not...but have GB fiber connections. So, there are variables....too may for
a best-fit solution for every case. I'm checking for the simplest, least
difficult default as that should work in the vast majority of cases. If
something does need help, then I can get in an optimize it, as required. I
have one case like that already and it was a total bear to finish.
Sometimes you get the bear, sometimes the bear gets you...

I was doing some chores today which gave me a chance to run tests and leave
my computer alone while things ran. What I wanted to test out was nothing
but the behavior of creating my big text object. What happens with it after
that isn't relevant. The text block built is identical on each run.

So, I did milliseconds before & after and memory stats before & after. Then
looked at the results. I ran compiled, single-user, macOS 10.12.6 with 16GB
or RAM and a 3GB cache. This is a MacBook Pro from the past couple of
years...one of the ones with the shitty keyboard. Man, that thing really
makes me less productive.

Back to the tests.  I ran compiled only. I launched, ran a test, got the
results, ran the test again, got the requests and quit. Just because that's
how I normally test stuff, at a bare minimum. I didn't reboot. My test
sample pulls five smallish fields from about 30K records in a table with a
lot more fields. The methods for navigating the data are what differ. The
four versions that I checked are:

* *GOTO SELECTED RECORD*
* *SELECTION TO ARRAY* + *For* loop
* Entity selection + *For each* loop
* Entity selection + *For* loop

I'll start with the speed report as it's short: It doesn't matter. The
performance range for the four solutions on their two runs was so small
that I chalked it up to normal variability. So, *speed is not an issue*.
This leaves memory, ease of development, and stability/brittleness. For
memory, I'm not terribly confident in the numbers. The first run values are
confounded by everything else that might be happening with the first run
against that data, so I've ended up ignoring those results. The second run
results confirm what Keith said. Namely, that *SELECTION TO ARRAY* takes
more memory than *GOTO SELECTED RECORD*. Kind of has to. On the two entity
selection results, they seemed to use more memory than *GOTO SELECTED
RECORD* and less than *SELECTION TO ARRAY*. Or maybe not, I'm not testing
more. I don't want to start any superstitions here. I will say again though
that if you care and want to test it, forget first principles. Even if you
had and understood the 4D source code, you would have very poor chances of
guessing how things would behave by the time you're running your code. If
anyone does some testing, I'm sure there's an appetite for the results.

Okay, that's it. Not the kind of results that you can broadly apply. There
are just so many factors involved. For example, the memory profile on
*SELECTION
TO ARRAY* grows as you add more and more arrays. It has to, and you can
even calculate the costs pretty well. The difference for *GOTO SELECTED
RECORD* is likely very little. I only tested on a machine that runs the
engine, Client would likely show dramatically different behavior.

Regarding the entity selections, that was just out of curiosity. I won't be
using them here. They don't offer a speed or memory advantage over *GOTO
SELECTED RECORD* and they make the code vulnerable to breakage. Someone
changes the capitalization of a field, and my code breaks. My server side
code with no UI...Not worth the risk. The code clarity on *GOTO SELECTED
RECORD* and then using field references is *excellent*. It's at least as
good as the readability of the entity-dot versions. But with regular old
field references, there's no risk of weird breakage and the code is
instantly understandable to any existing 4D programmer.
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

AW: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
In reply to this post by 4D Tech mailing list
Not really an answer to your question but a more general observation:

Very often, when I have to do an export function towards other systems, I end up in doing a bulk export.

Either because I am lazy or the other side is lazy or both sides are lazy. It's easy, just send everything. But sometimes it is worth, doing more homework, on both sides, and export only what has changed since the last export. Sometimes this is simply not possible (too complex, the other side wants all data etc.), but very often it would be. And it would - most of the time - result in a drastic reduction of amount of data to exchange.

|||||| https://flury-software.ch/

-----Ursprüngliche Nachricht-----
Von: 4D_Tech <[hidden email]> Im Auftrag von David Adams via 4D_Tech
Gesendet: Samstag, 15. September 2018 09:08
An: 4D iNug Technical <[hidden email]>
Cc: David Adams <[hidden email]>
Betreff: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

Short version:
I need to load some fields from records into a big text thingy.

[...]

**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
On Sun, Sep 16, 2018 at 9:09 PM Olivier Flury via 4D_Tech <
[hidden email]> wrote:

> Not really an answer to your question but a more general observation:
>
> Very often, when I have to do an export function towards other systems, I
> end up in doing a bulk export.
>
> Either because I am lazy or the other side is lazy or both sides are lazy.
> It's easy, just send everything. But sometimes it is worth, doing more
> homework, on both sides, and export only what has changed since the last
> export. Sometimes this is simply not possible (too complex, the other side
> wants all data etc.), but very often it would be. And it would - most of
> the time - result in a drastic reduction of amount of data to exchange.
>

Well observed. In this case, I have the good fortune to be working with
source tables that include modification counters. (Think "sequence numbers"
on each change.) This makes it *incredibly* easy to find all records that
have been modified or added since the last push.

For deletion, records in 4D are marked with a deleted date. I've always
found this practice annoying and weird...and boy am I glad it's being used
in this. With the delete date field in place (soft deleted records), it's
simple to handle delete reconciliation. I don't push the delete date field,
just a Boolean based on it. So, you push everything up including the delete
flag, then a simple DELETE FROM table WHERE marked_for_deletion = true
takes care of it. Amazingly easy. Welsh Harris and Tim Nevels get all the
credit for setting things up this way in advance.

Even still, there are sometimes still millions of records to move,
particularly when we're first pushing a table over. In most cases, pushes
are quite manageable in size on a day-to-day basis. These tables are
getting pushed largely for aggregation and analysis, so it's not one-off
operations. Push and keep pushing as more data flows through 4D.
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
In reply to this post by 4D Tech mailing list
David Adams:

> I'm checking for the simplest, least difficult default as that should work in the vast majority of cases.


Here are my two Eurocents:
it does not matter how you collect the data, but concatenating the text in memory vs. writing to a file makes a HUGE difference when you have lots of records.

I have a method in my toolbox  that writes out a selection or a whole table into a file.
Looping through the records and appending each record to a file.
Works great.

Then I did the same but instead of writing to a file, I appended to a textvariable.
Works well for small amounts of data but is unusable for more than about 20000 records as it slows down ever more.
(I am lazy and did not do any pre-sizing, chunking, whatever.)

So if you want to export the data anyway and look for the simplest way, then just loop through the data and write into a file.

Greetings from Germany,
Bernd Fröhlich
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
Bernd,

Thanks for the report, it's interesting and believable. In this case, I
have to get the data into memory to send it... I guess I could write to a
file, load that and then send it. Interesting concept, thanks!

On Mon, Sep 17, 2018 at 4:53 PM Bernd Fröhlich via 4D_Tech <
[hidden email]> wrote:

> David Adams:
>
> > I'm checking for the simplest, least difficult default as that should
> work in the vast majority of cases.
>
>
> Here are my two Eurocents:
> it does not matter how you collect the data, but concatenating the text in
> memory vs. writing to a file makes a HUGE difference when you have lots of
> records.
>
> I have a method in my toolbox  that writes out a selection or a whole
> table into a file.
> Looping through the records and appending each record to a file.
> Works great.
>
> Then I did the same but instead of writing to a file, I appended to a
> textvariable.
> Works well for small amounts of data but is unusable for more than about
> 20000 records as it slows down ever more.
> (I am lazy and did not do any pre-sizing, chunking, whatever.)
>
> So if you want to export the data anyway and look for the simplest way,
> then just loop through the data and write into a file.
>
> Greetings from Germany,
> Bernd Fröhlich
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> Archive:  http://lists.4d.com/archives.html
> Options: https://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:[hidden email]
> **********************************************************************
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Do I want to use SELECTION TO ARRAY instead of GOTO SELECTED RECORD server-side in V17?

4D Tech mailing list
In reply to this post by 4D Tech mailing list
David,
(not on v17)
I've played around with large text, mostly for testing purposes.
large texts (>150meg)
- very easily cause massive slow downs when trying to append
- can easily crash/hang 4D.

While the *limit* on text maybe 2gig (+/-) my experience is that you
can not get anywhere near that and expect to be able to manipulate the
text in any reasonable time.

So....
whatever end strategy you use, I would suggest you place a limit on the
size of the text and export (or whatever) at that point.

Chip
On Sat, 15 Sep 2018 10:40:24 +0000, Keith White via 4D_Tech wrote:
> Building up a single big text thingy will have RAM (and eventually
> potentially performance) issues if using straight string/text
> concatenation for each iteration.  Because each time you add
> something to a string, a copy is made in RAM which is bound to lead
> to memory fragmentation and a higher workload for garbage
> collection.   How significant is this?  No idea.  I've always avoided
> building up super-massive text as a single thing.
---------------
Gas is for washing parts
Alcohol is for drinkin'
Nitromethane is for racing
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************