Tip: Binary insertion code

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Tip: Binary insertion code

David Adams-4
4D introduced binary searches on arrays back in V14 R4, so everyone with
V15 and up has it. Who knew?

http://doc.4d.com/4Dv15/4D/15.3/Find-in-sorted-array.301-3152640.en.html

What I can't find is a companion function for calculating the insert
position of a new element. Having such a routine helps make Find in sorted
array more useful. If you have to sort an array to search on it quickly,
the sort may cost more than you save on the find. Instead, control
insertion into the array so that it's always in sorted order without ever
having to call SORT ARRAY. That doesn't make sense in every situation but,
when it does, it's a great helper. As far as I can see, 4D doesn't have the
function we need so I figured I'd write one. I did better than that and
stole some old code from Dave Terry and updated it. (Posted below.)

For those that don't already know about binary searches, they take
advantage of the existing sort order to find elements more quickly (in
general.) Instead of starting at the first element and scanning through to
the end, you start in the middle and then go up/down depending on where
your value hits. You progressively narrow the range of searches until you
either hit a match or find where the element would go. Just like looking up
a word in a dictionary or a name in a phone book. It's a core algorithm
that many of you already know. For those that don't and for those that are
rusty (cough-cough...not me, no never...), here are a few links:

A very popluar and readable discussion that, unfortunately, uses a
recursive implementation (Dave's code shows how unnecessary recursion is in
this case. $50 fine for the first person to use "elegance" in combination
with "recursion" in a sentence.)

https://jeffreystedfast.blogspot.com.au/2007/02/binary-insertion-sort.html

There are several related topics on the Wikipedia, but this article looks
pretty on the mark:

https://en.wikipedia.org/wiki/Binary_search_algorithm

I started to adapt the version from

https://jeffreystedfast.blogspot.com.au/2007/02/binary-insertion-sort.html

...but hated the pointless recursion. I remembered that Dave Terry had
written this up years ago and dug up a copy of the note. Unfortunately,
these older notes are not listed in the KB, even when they still apply.
Hey! 4D guys! There's lots of great stuff in the old notes that you could
strip mine for new material. Find just about anything by Dave Terry or
Nicolas Barcet, as a starting point. Anyway, here's Dave's code, updated
and slightly tweaked:

If (False)
  // Array_BinaryFindPosition

  // Adapted from "Binary Searches:, ACI US Technical note #28, May 1991 by
Dave Terry.

  // Notes and changes:
  // * Updated to current version of 4D language.

  // * Various tweaks to variable names.

// * Poured on a bit of extra syntactic sugar to make sure that 4D sorts
through
// the pointers correctly.

  // * Change in behavior: A value past the end of the array returns Size
of array+1.
  //   Dave's code returned Size of array instead. (Deliberately.) For
insert operations,
  //   it is more useful to get back where the element should go. As a
consequence, you need
  //   to test if the index is past the end of the array or not.

// * Change of behavior: Dave expected string values, I'm using pointers to
array and value.
End if

C_LONGINT($0;$result)
C_POINTER($1;$array_pointer)
C_POINTER($2;$value_pointer)

$array_pointer:=$1
$value_pointer:=$2

C_LONGINT($high)
C_LONGINT($low)
C_LONGINT($current)
C_LONGINT($size_of_array)
$size_of_array:=Size of array($array_pointer->)

Case of
: (($value_pointer->)<=($array_pointer->{1}))
$result:=1  // Less than or equal to item 1, use item 1.

: (($value_pointer->)=($array_pointer->{$size_of_array}))
$result:=$size_of_array  // Equal to the last item, return it.

: (($value_pointer->)>($array_pointer->{$size_of_array}))
$result:=$size_of_array+1  // It's greater than the last element, specify
new position. ** This element does not exist yet. **

Else
$low:=1
$high:=Size of array($array_pointer->)

Repeat

$current:=$high+$low\2
If ($value_pointer->>$array_pointer->{$current})
$low:=$current
Else
$high:=$current
End if

Until ($low>=($high-1))

$result:=$high

End case

$0:=$result

I have only done limited testing on this but will be using the code next
week. If anyone finds problems or can suggest improvements, let me know.
The only test code I've run so far is for find, not insert and only with
longints. Here it is:

  // Quick-and-dirty checking code for binary position function.
// Not final test code, insert not proofed.

// Build a sorted array
ARRAY LONGINT($array_longint;0)
APPEND TO ARRAY($array_longint;1)
APPEND TO ARRAY($array_longint;2)
APPEND TO ARRAY($array_longint;3)
APPEND TO ARRAY($array_longint;4)
APPEND TO ARRAY($array_longint;5)
APPEND TO ARRAY($array_longint;6)
APPEND TO ARRAY($array_longint;6)
APPEND TO ARRAY($array_longint;8)
APPEND TO ARRAY($array_longint;9)
APPEND TO ARRAY($array_longint;10)

// Prepare a dump report
C_TEXT($dump)
$dump:=""
$dump:=$dump+"Value"+Char(Tab)
$dump:=$dump+"Position"+Char(Carriage return)

  // Loop through the array testing where each item should go.
  // Since we're checking the array contents themselves, each item
// should return it's actual position.
C_LONGINT($size_of_array)
C_LONGINT($index)
$size_of_array:=Size of array($array_longint)
C_LONGINT($index)
For ($index;1;$size_of_array)
C_LONGINT($result)
$result:=Array_BinaryFindPosition (->$array_longint;->$index)

$dump:=$dump+String($index)+Char(Tab)+String($result)+Char(Carriage return)
End for

// Try a value that is lower than any in the array. It should return 1.
C_LONGINT($test_value)
$test_value:=-1
$result:=Array_BinaryFindPosition (->$array_longint;->$test_value)
$dump:=$dump+String($test_value)+Char(Tab)+String($result)+Char(Carriage
return)

// Try a value that is higher than any in the array. It should return Size
of array +1 (10).
$test_value:=13
$result:=Array_BinaryFindPosition (->$array_longint;->$test_value)
$dump:=$dump+String($test_value)+Char(Tab)+String($result)+Char(Carriage
return)

// Grab the text and have a look at it.
SET TEXT TO PASTEBOARD($dump)

Here's what my output looks like:

Value Position
1 1
2 2
3 3
4 4
5 5
6 6
7 8
8 8
9 9
10 10
-1 1
13 11
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Fwd: Tip: Binary insertion code

David Adams-4
Douglas sorted me out on this one...but I see he only emailed me. I don't
want to confuse anyone checking the archive, so I'm passing the thread back
up. First Douglas' message to me and then my response:

The docs read per below


Position of its first occurrence if the value is found; otherwise position
where the value should be inserted


That should do it?


---------- Forwarded message ----------
From: David Adams <[hidden email]>
Date: Sat, Jan 14, 2017 at 5:51 PM
Subject: Re: Tip: Binary insertion code
To: Douglas von Roeder <[hidden email]>


I see. Clearly, I'm always the last to know. But like they say, "Why
reinvent the wheel when you can reinvent the whole car..." ;-)

Doug, thanks for letting me know. As it stands the sample code is
potentially still of some use to any of our friends using V14 R3 or
earlier. For anyone in that camp, I ran my limited test cases through the
4D command it provides the same answers as my code. The 4D code is better
because in a lot of ways, particularly that it tells you if the position
indicates where the target value *is* or where it *should be.* Super handy.
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

Peter Jakobsson-2
In reply to this post by David Adams-4

On 14 Jan 2017, at 01:57, David Adams <[hidden email]> wrote:

> 4D introduced binary searches on arrays back in V14 R4…..What I can't find is a companion function for calculating the insert
> position of a new element


Interesting point.

Maybe the thinking was that the ratio of lookup actions to populating actions is normally so huge that there’s no advantage in optimising the former ?

It seems unlikely that one would want to alternately add an element - do a look up - add an element - do a lookup etc..in iterations running into the 10’s of thousands which is what it takes for “Find in Array” to start becoming noticeably sub-optimal compared to binary searches. (Although, once it does, the binary search makes a massive difference, especially in recursive logic where the number of array lookups increases exponentially with the size of the array).

Peter
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

David Adams-4
On Sun, Jan 15, 2017 at 05:05 Peter Jakobsson <[hidden email]> wrote:

Hey Peter, hopefully you've seen my other post. I obviously hadn't read the
docs because Find in sorted array does everything that I could hope for,
and does it quite nicely. It's set up for find and insert operations and
works great. A very nice addition to the language.

>
> (Although, once it does, the binary search makes a massive difference,
> especially in recursive logic where the number of array lookups increases
> exponentially with the size of the array).
>

There's no need to use recursion. Dave Terry's code avoids it with
iteration but you can always replace recursion with a stack. Recursion
automatically gives you a stack for "free"...but of the cost of stacking a
full method instance+variable frame instead of just stacking the values in
a stack that you manage.

Also, if you care, check the Wikipedia (or wherever you like), and you'll
see that neither approach involves an exponential increase. If it matters
to youl
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

Arnaud de Montard
In reply to this post by David Adams-4

> Le 14 janv. 2017 à 07:53, David Adams <[hidden email]> a écrit :
>
> [...] "Why reinvent the wheel when you can reinvent the whole car..." ;-)

You have to know that, at the very beginning, that car had no wheel  :-)
(translated from french)
  "Binary search in array is great. But something's missing IMHO.
   Where do I insert the value in the array if it is not found?
   Seems obvious this information goes along with a binary search.
   We would not have to sort every time we add a value,
   especially if several arrays are synchronized."
And in v14r4, the car had wheels.

--
Arnaud de Montard


**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

Keisuke Miyako
that is absolutely right.

the syntax and behaviour of "Find in sorted array" was modified during the beta phase of 14R4,
"Following a feedback received from the customer who was the source of the feature request"

Implementation changes in feature "Find in array by dichotomy" (2014)
http://forums.4d.fr/Post/FR/15548126/1/15548127#15548127

> 2017/01/15 8:13、Arnaud de Montard <[hidden email]> のメール:
>
> You have to know that, at the very beginning, that car had no wheel  :-)
> (translated from french)
>  "Binary search in array is great. But something's missing IMHO.
>   Where do I insert the value in the array if it is not found?
>   Seems obvious this information goes along with a binary search.
>   We would not have to sort every time we add a value,
>   especially if several arrays are synchronized."



宮古 啓介
セールス・エンジニア

株式会社フォーディー・ジャパン
〒150-0043
東京都渋谷区道玄坂1-10-2 渋谷THビル6F
Tel: 03-6427-8441
Fax: 03-6427-8449

[hidden email]
www.4D.com/JP

**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

David Adams-4
For anyone using a version of 4D prior to V14 R4, you may actually find the
Dave Terry code that I tweaked and posted of use. It's 25 years old and
still works. The issue is, how do you distinguish between a result that
says "this is where the element you're looking for *is*" and one that says
"this is where the element *belongs*, if you add it." For cases where you
keep an array of unique values and don't care about duplicates:

* Find the element's position.

* If the position is greater than the size of your array, you need to add
an element.

* If the position is less than or equal to the size of your array, does the
target value match the existing value at that position?

* If they match, you're done because the value is already in place.

* If they don't match, you need to insert the new element in that position
and set it with your target value.

Unique value arrays and binary insert/lookup are a pretty natural
combination, so this isn't a rare situation to find yourself in. Is it work
the effort? Depends. For a tiny array, a sequential (find in array) scan is
just fine. You can test it out in your environment to see if it's worth the
trouble. Then again, with the 4D command (or the more limited version
adapted from Dave Terry) in place, it's very little work to maintain an
array in sorted order & use binary search and insert.



On Sun, Jan 15, 2017 at 11:07 AM, Keisuke Miyako <[hidden email]>
wrote:

> that is absolutely right.
>
> the syntax and behaviour of "Find in sorted array" was modified during the
> beta phase of 14R4,
> "Following a feedback received from the customer who was the source of the
> feature request"
>
> Implementation changes in feature "Find in array by dichotomy" (2014)
> http://forums.4d.fr/Post/FR/15548126/1/15548127#15548127
>
> > 2017/01/15 8:13、Arnaud de Montard <[hidden email]> のメール:
> >
> > You have to know that, at the very beginning, that car had no wheel  :-)
> > (translated from french)
> >  "Binary search in array is great. But something's missing IMHO.
> >   Where do I insert the value in the array if it is not found?
> >   Seems obvious this information goes along with a binary search.
> >   We would not have to sort every time we add a value,
> >   especially if several arrays are synchronized."
>
>
>
> 宮古 啓介
> セールス・エンジニア
>
> 株式会社フォーディー・ジャパン
> 〒150-0043
> 東京都渋谷区道玄坂1-10-2 渋谷THビル6F
> Tel: 03-6427-8441
> Fax: 03-6427-8449
>
> [hidden email]
> www.4D.com/JP
>
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> FAQ:  http://lists.4d.com/faqnug.html
> Archive:  http://lists.4d.com/archives.html
> Options: http://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:[hidden email]
> **********************************************************************
>
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

David Adams-4
Hey, and how about that...another use for binary search. I'm stuck in a
Wikipedia rabbit hole about distributed hashing, and just ran into an in
interesting use of a binary search. Imagine that you're trying to
distribute storage or work amongst a bunch of things. For simplicity,
imagine that you want to store records with names starting with "A" on one
machine, "B" on another, and so on. If you have a list of addresses and a
list of the starting strings, you can do a binary looking and get the right
machine address.

A   Machine 1
B   Machine 2
...
and so on. So imagine 26 machines as a bucket for records, divided up by
initial letter in the English alphabet. So far, so good. But wait, letter
frequency isn't even, so some machines are going to get way more records
than another.  Here's one initial letter frequency sequence:

The first letter of an English word, from most to least common:

t o a w b c d s f m r h i y e g l n p u v j k q z x

https://en.wikipedia.org/wiki/Letter_frequency

Maybe you want to split "T" onto three machines:

T
Ti
To

Just guessing there on the splits. (The the dvisions would depend very much
on the particular data set.) Now how does a binary search help you? Because
you don't *care* about an exact match, you're just looking for the nearest
match. So if you word is Tuesday, you should hit the "To" machine and if
your word is "Table" you should hit the "T" machine, and so on.

Okay, that was a pretty meandering and somewhat distracting way to say
"sometimes the goal of a binary search is to find the nearest match." I
have a particular love of algorithms that find "nearly" rather than
"exactly"....

For anyone with an interest in algorithms and a bit of time to waste, I
recommend heading down the hashing rabbit hole. If you already understand
hash tables, the interest bits are hash trees and distributed hashes. These
two are both quite interesting:

https://en.wikipedia.org/wiki/Consistent_hashing
https://en.wikipedia.org/wiki/Rendezvous_hashing
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

Peter Jakobsson-2
In reply to this post by David Adams-4

On 14 Jan 2017, at 23:32, David Adams <[hidden email]> wrote:

> There's no need to use recursion. Dave Terry's code avoids it with
> iteration but you can always replace recursion with a stack. Recursion
> automatically gives you a stack for "free"

Hi David

I think you’re possibly referring here to the implementation of the binary search itself. I was more alluding to recursive operations on data sets at the logical level in general.

For example say I have an array which holds hierarchical data (arrCHILD, arrPARENT). When you want to scan the array with a logical function that tests for each element being in a particular parent group then that scan will involve a number of comparisons which will increase with the square of the data set size times the number of levels in the hierarchy if we use a linear lookup such as Find in Array.

Using a binary search brings the scan time back to a near-linear growth with the size of the data set.

A while back I implemented a “home rolled” binary search for this reason (see below) which dropped a 20 second (recursive) operation down to under a second. But my point was that you only really get any practical benefit from binary searches when there’s some kind of recursive data lookup like that involved because Find in Array is so fast anyway.

Best Regards

Peter

(P.S. I long time ago I learned the lesson to not waste too much time coding low level stuff in 4D because 4D eventually comes along and blows all your hard work away by building it into the language. About 6 months after I did this - feeling quite pleased with myself LoL, sure enough it appeared in some R-version of v15. Lesson re-learned ! All the same the linear search was simply unworkable at the time).

******************* Binary Search on Longint Array *********************

  //     $1: Pointer to data array to lookup (Must be pre-sorted in ascending order)
  //     $2: Pointer to index array for cross-reference
  //     $3: Value to find (i.e. the value who's array index is to be returned)
  //     $4: Pointer in which to return array index in optimiser arrays

C_POINTER($1;$2;$4;$P_ptrREFERENCE;$P_ptrINDEX;$P_ptrOptiINDEX)
C_LONGINT($0;$3;$P_ID_FIND;$P_idxMIN;$P_idxMAX;$P_idxFOUND;$P_idxMIDPOINT;$P_SIZE_ARRAY;$P_SIZE_RANGE)
C_POINTER($C_NO_POINTER)

$C_NO_POINTER:=apf_run_c_getNoPointer

$P_ptrREFERENCE:=$1
$P_ptrINDEX:=$2
$P_ID_FIND:=$3
$P_ptrOptiINDEX:=$4

$P_SIZE_ARRAY:=Size of array($P_ptrREFERENCE->)
$P_SIZE_RANGE:=$P_SIZE_ARRAY

$P_idxMIN:=1  //     Take an arbitrary halfway point
$P_idxMAX:=$P_SIZE_ARRAY
$P_idxFOUND:=(-1)

While (($P_idxMAX>=$P_idxMIN) & ($P_idxFOUND<0))

$P_idxMIDPOINT:=((($P_idxMAX-$P_idxMIN)\2)+$P_idxMIN)

Case of
: ($P_ID_FIND>$P_ptrREFERENCE->{$P_idxMIDPOINT})

$P_idxMIN:=($P_idxMIDPOINT+1)

: ($P_ID_FIND<$P_ptrREFERENCE->{$P_idxMIDPOINT})

$P_idxMAX:=($P_idxMIDPOINT-1)

: ($P_ID_FIND=$P_ptrREFERENCE->{$P_idxMIDPOINT})

  //     We've found it

$P_idxFOUND:=$P_idxMIDPOINT
$P_idxMAX:=$P_idxMIDPOINT
$P_idxMIN:=$P_idxMIDPOINT

End case

End while

If ($P_idxFOUND>0)
$0:=$P_ptrINDEX->{$P_idxFOUND}
Else
$0:=(-1)
End if

  //     Return the optimiser index if required

$P_ptrOptiINDEX->:=$P_idxFOUND

**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

KirkBrooks
In reply to this post by David Adams-4
David,
As a rule of thumb how large an array are you dealing with where managing
them like this starts to matter?

I'm thinking thousands+ of elements but maybe it's a noticeable bump on
smaller ones?

On Fri, Jan 13, 2017 at 4:57 PM, David Adams <[hidden email]> wrote:

> 4D introduced binary searches on arrays back in V14 R4, so everyone with
> V15 and up has it. Who knew?
>
> http://doc.4d.com/4Dv15/4D/15.3/Find-in-sorted-array.301-3152640.en.html
>
>
--
Kirk Brooks
San Francisco, CA
=======================
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

Arnaud de Montard

> Le 15 janv. 2017 à 18:31, Kirk Brooks <[hidden email]> a écrit :
>
> [...]
>
> I'm thinking thousands+ of elements but maybe it's a noticeable bump on
> smaller ones?

I put some test code under. It seems the position of the value to search (lines with *) seems to strongly influence the result. In the middle, sorted array is faster about 1000 items.

ARRAY TEXT($item_at;0)
$factor_l:=100  //vary here
$lorem_t:=Str_loremIpsum (500)*$factor_l  //returns a suite of words…
Str_explode (->$item_at;$lorem_t;" ")  //words to array
SORT ARRAY($item_at;>)
$soa_l:=Size of array($item_at)
  //$whereToFind_l:=$soa_l\$factor_l  //* value to search near to an end
$whereToFind_l:=$soa_l\2  //* value to search in the middle
$val_t:=$item_at{$whereToFind_l}
$ticks_l:=Tickcount+(60*5)
ARRAY LONGINT($ms_al;2)
$ms_al{1}:=0
$ms_al{2}:=0
Repeat
        $ms_l:=Milliseconds
        $z:=Find in array($item_at;$val_t)
        $ms_al{1}:=$ms_al{1}+(Milliseconds-$ms_l)
        $ms_l:=Milliseconds
        $z:=Find in array sorted($item_at;$val_t;>;$star_t;$end_t)
        $ms_al{2}:=$ms_al{2}+(Milliseconds-$ms_l)
Until (Tickcount>$ticks_l)
TRACE  //read result, influence of $factor_l,

--
Arnaud de Montard




**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

David Adams-4
In reply to this post by Peter Jakobsson-2
> I think you’re possibly referring here to the implementation of the binary
> search itself.

Yes I was.

> For example say I have an array which holds hierarchical data (arrCHILD,
> arrPARENT). When you want to scan the array with a logical function that
tests
> for each element being in a particular parent group then that scan will
> involve a number of comparisons which will increase with the square of the
> data set size times the number of levels in the hierarchy if we use a
linear
> lookup such as Find in Array.

I don't understand what this example is at all, but I'll take your word for
it.

> (P.S. I long time ago I learned the lesson to not waste too much time
coding
> low level stuff in 4D because 4D eventually comes along and blows all your
> hard work away by building it into the language.

Sure, eventually. I do wish I'd built my own hash tables 15 years ago as we
didn't get them (after a fashion) until V14 and I most *definitely* needed
them in a V13 app.
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

David Adams-4
In reply to this post by KirkBrooks
> As a rule of thumb how large an array are you dealing with where managing
> them like this starts to matter?
>
> I'm thinking thousands+ of elements but maybe it's a noticeable bump on
> smaller ones?

Hey Kirk, I haven't the faintest idea. (But check the bottom of this
message for more details.) It's going to depend on what you're doing, but
that may not be the point. To set the scene, I'm looking at 4D's C_OBJECT
and related commands these days. I'm trying to sort through the best
options available for various situations. One big limitation in these
commands is that you can address items by position (relative or absolute),
and you can't address array elements by index unless you pull the whole
array out with OB PARSE ARRAY. I don't actually know if this is a
real-world problem, but it bugs me. (I haven't run speed tests.) I'm just
experimenting.

<fair_warning>I go into full-on nerd mode below. There are some good points
and a nice twist on the available object-access strategies in 4D. Buy,
well, I'm going full nerd.</fair_mailing>

So what does this have to do with binary search? Well, I'm dealing with
items with some unique property - or perhaps two unique properties. That
means that some kind of indexed can be built and that index can be ordered.
I asked about techniques on the list and three came up. And remember, I'm
dealing with unique values - my example uses values that aren't likely to
be unique...but area easy to ready

1) Build a parallel array with the index values. So, you store

  [Anderson]
  [Berkowitz]
  [Fredrickson]

...and those three elements match up with larger objects stored in an
object array. If I want to find "Anderson", I look at the tiny text array
and then know if a) Anderson is defined at all and b) where in the big
object array to look for it.

2) Another idea: Skip the hassle and space of storing a secondary array and
just load up the object array and iterate over it. The downsides here are
obvious:

-- If an item doesn't exist, you have to pay the maximum price: Load the
full object array and scan all of its items. With the index array, all you
need to do is scan the text array. Even the full price (scan all items) is
smaller, and you still get a definitive answer. At this point, you can also
do a binary search and (generally) optimize the lookup.

3) The third idea from Rob and Justin is to use the unique value as a key
name. This is a pretty awesome idea for 4D, as it turns out. Unfortunately,
it's not an obvious one, given the names of the commands and it makes some
specific cases of processing the JSON a bit more difficult. (I'd guess that
most people won't run into this situation, and it isn't a flaw of any
kind.) Imagine that you're storing product definitions with a product code:

{
   "products":[
      {
         "id":"ABC",
         "price":1.23
      },
      {
         "id":"DEF",
         "price":4.56
      }
   ]
}

This is just made-up JSON I've typed in by hand, but hopefully you get the
idea. Imagine that there are a stack of properties, not just a price. So, a
big object. Here's the version where the ID is used as the key:

{
   "ABC":{"price":1.23},
   "DEF":{"price":4.56
   }
}

Again, made-up JSON and imagine that there are a bunch of properties. The
deal here is that you can get 4D to grab items using the ID:

OB GET ($object;"ABC")

You don't have to iterate over everything (not that that's straightforward
to start with in 4D), and you don't have to put everything into an
extractable array...and then pull the whole array out.

So, #3 is kind of kick-ass solution in the 4D-specific universe. (With a
complete command set, you would not necessarily end up doing things this
way.) Oh, the reason this is so good in 4D is because of how object access
works *inside* of 4D. How does it work? All we know is that it's a hash
table - nothing more. There are a lot of ways to deal with the specifics of
implementing a hash table, and they have a great deal in common with the
tools and trade-offs used in building a variety of index operations. LR has
worked with these sorts of questions for a very, very long time and is
obviously very good at it by now. So, take it as read that the underlying
hash table implementation we use indirection with the C_OBJECT and object
field commands is first rate. That's why the
random-access-by-using-the-unique-value-as-an-actual-key kicks ass. It's
because the keys are supported by an underlying data structure that's
optimized for random access to an unordered value.

But there are a couple of limits on the approach above. One of which is
that you only get the one index. If you have a couple of properties, it
doesn't help. Also, what if you have a good reason to keep all of the data
in a single large object? That's good for serialization, archiving, and
transmission (as a parameter, record, message of any sort.)

So what about combining #1 (support array of unique values) and #3 (main
unique value as key for a key:value pair where unique_value:{definition}.)
Sure you can.  Take this JSON as an example (again, not generated with code
so apologies for glitches):

{
   "person_id_index":"[1,3,5,8,28]",

 "person_last_name_index":"[\"Leavens\",\"Smith\",\"Carr\",\"Harris\",\"Stewart\"]",
   "person_definitions":[
      {
         "id":1,
         "last_name":"Leavens",
         "first_name":"Justin",
         "country":"USA"
      },
      {
         "id":3,
         "last_name":"Smith",
         "first_name":"Cannon",
         "country":"Canada"
      },
      {
         "id":5,
         "last_name":"Carr",
         "first_name":"Justin",
         "country":"Australia"
      },
      {
         "id":8,
         "last_name":"Harris",
         "first_name":"Welsh",
         "country":"USA"
      },
      {
         "id":28,
         "last_name":"Stewart",
         "first_name":"Wayne",
         "country":"Australia"
      }
   ]
}

You've got five people with a few attributes. Pretend that ID is unique and
that last name is also unique. The objects are ordered by the ID. What then
are person_last_name_index and person_id_index? They're indexes into the
objects. Let's say you wan to find out if the person with ID 28 is in the
array of people definition objects. Naive approach:

ARRAY OBJECT($people_ao;0)
OB GET ARRAY($object;"person_definitions";$people_ao)
For ($index;1;Size of array($people_ao))
  If (OB Get($people_ao{$index};"id")=28)
     $match:=$index
     $index:=Size of array($people_ao)+1 // Break
  end if
End for

That's off the top of my head (I'm in BBEdit now), so apologies for
glitches.

There's nothing wrong with this approach, but what if you have 10's or
100's of thousands? Justin Leavens had such a situation in the real world,
which lead him to the idea of using the main key as an actual key. Okay,
back to the example, What help do we get from person_id_index in this case:

ARRAY LONGINT($ids_al;0)
JSON PARSE ARRAY(OB Get($object;"person_id_index"))

....assuming that I got the syntax right, this takes a simple JSON text
block:

   "person_id_index":"[1,3,5,8,28]",

and turns it into a longint 4D array:

[1]
[3]
[5]
[8]
[28]

It's ordered, so we can use a binary search to find out if the ID we want
is present or not, and, if present, where exactly. How does this help?
Imagine you've got 100,000 objects and the ID you want to match is not
found. If you have to load and scan the entire array of objects
*sequentially*, that's a lot of work. A non-matching element is always
worst-case time because you have to scan the entire array from start to
finish to prove that your item isn't there. No binary search is possible
because there isn't any order. (They're just a bunch of objects.) The
use-the-key-as-the-key trick works with 4D because it's got a
well-implemented hash table underneath it all. Okay now imagine that you
instead of searching the 100,000 objects instead you can scan 1000,000
longints. What does that buy you. A lot. In a worst-case situation before,
you scanned the entire object array. In the new case:

* You do a binary search on the longint array and don't ever load the
object array, let alone scan the 100K items fruitlessly. Huge savings! How
huge? I've got a table below, but scanning 1,000,000 items takes a max of
30 comparisons for a binary search instead of 1,000,000 for a sequential
scan. Big savings.

* If you do need to load up the whole array and retrieve the object, you
don't need to find its position - you've already got it. That's another big
win. No need to scan the contents of the items until you find what you
need. The position in the index array = the position in the larger object
array that you're 'indexing'. So, again, a big win at a small cost.

* You've had to pay a price, loading the person_id_index and scanning it.
But, again, that's not terribly expensive relative to the benefit, in this
case. In the real world, you might want a parallel array that doesn't need
constant rehydration before use. If you eventually need to inject the array
into your object for storage, etc., you could do that. (As a related
example, I've got an ErrorStack object where I keen an ARRAY OBJECT going
to make counting, pushing, and popping easy. I only convert it to a single
big object when I need it serialized.)

Now, in the example so far, you will still be better with setting the ID as
the key name because of 4D's underlying hash table system. (You need to use
a legal JSON name, so it might be "ID_1":{"price":1.23}) But what about
searching by a second key? My last name example is terrible, but imagine a
product with a public SKU and a different internal ID. Not that rare a
situation in a lot of tables. Anyway, this is where my example uses

"person_last_name_index":"[\"Leavens\",\"Smith\",\"Carr\",\"Harris\",\"Stewart\"]",

These values are ordered in the same way as the base objects, so they're
not ordered in a useful way by themselves. But, their order does allow you
to relatively inexpensively recover the position of the larger object. In
this case, I'd be using Find in array...but you could even make a small
object structure using technique #3 described earlier.

So that's what I'm thinking about when it comes to binary searches. I'm
using it for something like the person_id_index because, well, why not?

Speaking of why not, your original point was about the break-even point for
using a binary search instead of find in array. I could say "it depends",
but that's a poor answer, in this case. First take the extreme cases.
You've got an array of 1,000,000 unique items:

* Your target value is in element 1.
* Your target value is in element 1,000,000.
* Your target value is not in the array.

With a find in array, you scan the array from 1 to 1,000,000, so :

* Your target value is in element 1: 1 comparison. You win!
* Your target value is in element 1,000,000: 1,000,000 comparison. You
lose, biggly.
* Your target value is not in the array: 1,000,000 comparisons. You lose,
biggly.

With a binary search, the number of required comparisons is *massively*
reduced. I had to look up the formula for calculating the maximum # based
on the size of the sorted array and it is log2(n), rounded up to next whole
integer. Here's how this breaks down, according to the table I found in
"Explorations in Computing: An Introduction to Computer Science and Python"
(I just did a Google search and it came up in Google Books.)

Size                 Max comparisons
                            2    1
                            4    2
                            8    3
                          16    4
                     1,000   10
                     2,000   11
                     4,000   12
              1,000,000   20
       1,000,000,000   30
1,000,000,000,000   40

What you have to say "scales well." If you've got a situation where you can
benefit from binary search, why wouldn't you use it? It's just
substantially better and doesn't take meaningfully more work. For my cases,
I'd use anywhere I've got an array of unique values where I'm controlling
insertion. At that point, I'm potentially getting a gain with really no
cost. Where will it show up as a benefit in the real world? Now that
totally depends. Sometimes it won't, it depends on your
data/platform/situation. But if you're just talking algorithms, binary
search is superior.

Don't get me wrong! I'm not about micro-optimizations and I'm not about
going to extra work to optimize "just in case." Micro-optimization and
optimize-in-advance have gotten me into *way* more trouble down the years
than not optimizing something and having to go back. But when you get to
pick a design and one approach is broadly superior without costing extra
work, why not? Just seems sensible.

Warning: Notice from the earlier example that you can construct edge cases
where a sequential scan of an array takes fewer comparisons than a binary
search. So what. It's a stupid example. Seriously. I've seen way, way, way
too many 4D demos down the years that shows "massive" performance gains
that only worked in minority cases. By definition these are the cases of
least importance! If anyone wants to optimize something, make damn sure
they can prove it works more of the time, not only in weird (sometimes
entirely artificial) cases. This is why, for example, I've long been unable
to reproduce virtually any of the "optimizations" I've heard about 4D. They
only work in edge cases. LR and the team are doing a great job of
optimizing server operations and I don't see any reason to second guess
them.
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

David Adams-4
P.S. Another nice feature of a binary search as it adds no space relative
to a scan. You exploit the sort order, but don't need to hang any extra
attributes off the array to make it work. That's totally cool. That's one
reason that binary search is a basic, long-standing, and still-popular
technique. Massive speed optimization with zero extra space? Rare.
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

KirkBrooks
In reply to this post by David Adams-4
Hey David,
You know, you really are contributing massively to our collective knowledge
with these posts.

On Sun, Jan 15, 2017 at 2:57 PM, David Adams <[hidden email]> wrote:

> > I'm thinking thousands+ of elements but maybe it's a noticeable bump on
> > smaller ones?
>
> Hey Kirk, I haven't the faintest idea. (But check the bottom of this
> message for more details.) It's going to depend on what you're doing, but
> that may not be the point.

​Agreed. I did some testing by constructing a longint array populated with
random numbers (1 -> 100000). I used Find in array first. Then sorted the
array and repeated with Find in sorted array. ​Measuring in MS I didn't see
a difference until the array size got to 100k elements and I repeated Find
in array to find all instances of the target which I set as the element at
the bottom of the unsorted array. And even then the time difference was
only 1ms, uncompiled, v15.3. Find in array may have been quietly improved
while we weren't looking.

Here's the results from one run that found 4 instances in 100k elements:

Populate array: 5959ms
Find in array: (4)  1ms
Sort: 49ms
Find sorted: (60263, 60266)  0ms


One big limitation in these
> ​ ​
> commands is that you can
> ​['t]​
> address items by position (relative or absolute),
> and you can't address array elements by index unless you pull the whole
> ​
> array out with OB PARSE ARRAY.
>
​It's a real limitation compared to javascript. ​


> 1) Build a parallel array with the index values.

​And store this in the c-obj?

2) Another idea: Skip the hassle and space of storing a secondary array and
> just load up the object array and iterate over it.
>
​What I do pretty much all the time - though this is giving me some
different options.


> 3) The third idea from Rob and Justin is to use the unique value as a key
> name. This is a pretty awesome idea for 4D, as it turns out. Unfortunately,
> it's not an obvious one, given the names of the commands and it makes some
> specific cases of processing the JSON a bit more difficult.

​I took a stab at using this approach for little while. I build a big c-obj
with all the related data to sales orders. "Related data" being the various
permutations of line items and other things. I ultimately rejected it
because while it was good for 4D it made things more difficult outside of
4D. My ultimate aim was to have a description of these records I could
reference in 4D and in javascript.

With a binary search, the number of required comparisons is *massively*

> reduced.
> ​...​
>
> Size                 Max comparisons
>                             2    1
>                             4    2
>                             8    3
>                           16    4
>                      1,000   10
>                      2,000   11
>                      4,000   12
>               1,000,000   20
>        1,000,000,000   30
> 1,000,000,000,000   40
>
> What you have to say "scales well." If you've got a situation where you can
> benefit from binary search, why wouldn't you use it?
>
​You would. I will going forward. Part of what I'm trying to get a feel for
is if any of this offers significant improvement on things I'm doing a lot
of right now. ​


> Don't get me wrong! I'm not about micro-optimizations and I'm not about
> going to extra work to optimize "just in case." Micro-optimization and
> optimize-in-advance have gotten me into *way* more trouble down the years
> than not optimizing something and having to go back. But when you get to
> pick a design and one approach is broadly superior without costing extra
> work, why not? Just seems sensible.
>
​Agreed again.


> This is why, for example, I've long been unable
> ​
> to reproduce virtually any of the "optimizations" I've heard about 4D. They
> ​ ​
> only work in edge cases. LR and the team are doing a great job of
> ​ ​
> optimizing server operations and I don't see any reason to second guess
> ​ ​
> them.

​I agree. It took me a while to come around to this idea. That's why I tend
to look for solutions that let 4D do what it's built to do.

​What I'm seeing is these are techniques most applicable to really large
data objects. I'm most frequently dealing with object arrays of less than
50 elements. For things this small I can rely on the straight forward 4D
approach. But as I write that I can also see the benefits of a little more
structure in constructing my objects - even if they aren't on a scale where
using Find in array vs. Find in sorted array will make a measurable
difference.

--
Kirk Brooks
San Francisco, CA
=======================
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Tip: Binary insertion code

David Adams-4
> ​What I'm seeing is these are techniques most applicable to really large
> data objects. I'm most frequently dealing with object arrays of less than
> 50 elements. For things this small I can rely on the straight forward 4D
> approach. But as I write that I can also see the benefits of a little more
> structure in constructing my objects - even if they aren't on a scale
where
> using Find in array vs. Find in sorted array will make a measurable
> difference.

In everyday use, I'd expect that the most common payoff isn't from using a
sorted/unsorted array as either will be fast. The big payoff is when you
can use a small data type array (longint for example) stored either inside
or outside of the larger object to avoid iterating over lots of big
objects. If the index array is outside of the object (at least as long as
you are inside 4D), you could have 1,000 elements in a tiny array, check it
quickly *before* you have to load the object, reconstitute the component
objects to get the one you need. That could be helpful in a pretty broad
range of situations that ordinary people run into. Problems at scale - like
500K items exist, but they're a lot less common. Still, good to know the
range of options.
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[hidden email]
**********************************************************************