Memory allocation, release to OS - Fragmentation - Python - Elixir - Tcl
This article represents the idea and the difference between developers, scriptwriters and true programmers (actually they should be called software architects next to software engineers).
True PROgrammers understand their environment and how everything works from low level to high level.
This way they know when it's futile to try something and when you should have a workaround. They are the ones that listen to your problem and already have a solution in mind even if they haven't written a single line of code yet. They've tested this principle enough times to know what works and what doesn't.
The powerful software engineers are not afraid to test it again and if the tests look grim to change course.
Real Programmers will be saying: Thank you Reverse Engineering for teaching me how stuff actually works!
"People find thinking to be difficult because civilization has never made any attempt to make thinking simpler." Edward de Bono
Somewhere in November 2017 I saw a programmer granting a sum of money to any experienced developer who could create a long-running script in Python. One that could run for months.
The script should allocate a list of 1000 integers, float and , deallocate data and not use up the memory. This should be done under raspberry pi linux.
NOTE: I am not an expert when it comes to Python. The only reasons I learned python was to use the great penetration testing tools that where developed. Together with the power Blender 3D offers when using python.
My main choices for comparison are always C, Tcl and Elixir. Since I've discovered Elixir I only use the Port module to access certain functionalities not provided by Elixir or Erlang itself.
This works great with Python for Data Science. This didn't stop me from thinking.
Of course just reading the offer instantly made me realize that the person in question was having problems with the script not releasing the memory to the operating system.
This is true and happens even after you have freed up the memory and even after the garbage collector did its job.
Developers and programmers began offering solutions instantly without understanding what the person needs. The offer started out at 1000 EURO and ended at 5000 EURO.
I just posted something explaining the futility of the project and no solution would be good without modifications to the underlying system. Asking what his problem really was?
That the best viable solution should be to use a Supervisor model with subprocesses.
In the end no one got the 5000 prize. Yet, I had seen a certain blindness caused by a sum of money given for a simple thing?
How could those programmers be so ignorant to how the underlying system works? Even if I'm not an expert in Python it wasn't hard for me to understand the problem he had even if he didn't mention it explicitly. What did he mean?
Can you read? Can you Think?
This reminded me of the one time I was a project manager for a project on various freelance websites.
I did the project management because I was too busy working on other projects and couldn't do the programming, hence the client needed it pretty fast and wanted me to control the quality of things.
After I had analyzed the problem and written a detailed 2 page explanation of what needs to be done and WHAT was to be avoided, I posted it so others could bid.
What happened in the following days made me change my perception and understand that in order to be a true PROgrammer you need knowledge of the whole underlying system.
To my surprise NONE of the initial people who posted bids even read the requirements.
They just posted bids without even understanding the problem and the environment.
For me that was flabbergasting!
In all my time as a software PROgrammer(or better put engineer) I've always approached this problem in an unique way it seems!
I always analyzed the requirements. I did a quick simulation in my head of how a solution would look from a technical standpoint and how feasible it was.
Then I start going deeper into asking questions so I get a firm grip of what is actually needed.
After that I start analyzing everything. Doing some research and depending on project, I try to do a proof of concept, prototype, or MVP in under a day.
If I can't do the MVP in 2-3 days then the project either has a problem or the customer needs something else or has a different problem/
Then i start thinking about the implications, searching for potential solutions and when I'm done with my research can I give an answer and a offer.
Nope, those people just said they can do it like this. Clap of the hands. How the heck did they knew they could do it so fast? They didn't even get to know what the user needs?!
I started asking some questions of how they think they would implement it.
Funny, responses included the aspects mentioned in the requirements as TO BE AVOIDED. It's as if they didn't even bother to read the descriptions.
Fellow users, programmers and managers. This is how things really work. Even with the thousands of the meetings, you all do each day. This is the end result.
2 disputes later we found our match, someone who already did something similar in the past.
Understanding the problem and environment
First we need to understand how the memory model works.
Back to our python script which should be a long running script that doesn't eat up memory.
Whenever a new process is created the operating system and underlying kernel use something called Virtual memory Mapping.
This allows EACH process to have up to 4 GB of memory on 32 bit systems and more than that on 64 bit.
The OS/KERNEL handles the mapping from virtual to real.
Understand that it has to do the mapping for ALL processes running. Which in time leads to FRAGMENTATION of memory. C and C++ have these issues so this is normal that dyamic languages have it too.
https://www.design-reuse.com/articles/25090/dynamic-memory-allocation-fragmentation-c.html
Facebook encountered a similar issue in the past, i'll talk about this later.
Next to this the Operating System has a few tricks up it's sleeve to protect the memory from unauthorized hacking. These tricks were implemented in the past 20 years so they may lead to more fragmentation.
ASLR - Address Space Layout Randomization - It tries to randomize the space to make it harder for the same program to be run multiple times and have the same memory locations.
DEP/NX - The No Execute bit which segregates areas of memory for storage and execution.. Making sure certain portions are NOT executable. This means protecting against ROP and stuff.
stack canary - This involves modifying the prologue and epilogue regions of all functions so to make it harder to do stack buffer overflow attacks.
Other various protections such as fortify_source, RELRO and PIE ..
Should a programmer know about these even if he's not into Reverse Engineering or cyber security? Yes, in theory, it will help him/her become a better PROgrammer.
All these protections make it much harder for bad things to happen in bad code. But at the same time, it makes the allocation/deallocation of memory a more difficult process.
Whenever we add things to the heap via pointers (strings, data objects) a reference is created to the actual data. But both the data and the reference occupy memory.
In a multi tasking environment this happens for all the processes, threads running.
Whenever we try to add random data to a process's memory for 1-2GB in an unaligned way this will increase the process memory usage.
When we free the memory and delete everything and even let the GC do it's work we will still have a huge portion of memory that is NOT free in certain environments.
WHY?
It's because of memory fragmentation which takes place in the real memory.
Even after you've freed your memory, the 1GB of data will probably be fragmented all across your RAM and unless you restart your app there is no way to release it.
For C and C++ there is a solution.. it's called jemalloc.
Certain Operating Systems and software packages had this problem too.
Facebook started using a different memory allocator called jemalloc just to try to fix this issue:
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/
Examples Please
Let me illustrate this with a few Python examples that you can run to test it out yourself.
I'm using Debian 9 with Python 2.7.13 and Python 3.5.4. The results may vary if you use different versions so beware.
I know from previous examples that scripting languages tend to have this issue. I've had the same issue in Tcl in the past, the only solution was a restart.
First of all Python 2.7 does have big issues on this point, while 3.5 has less. But we will see how even 3.5 will succumb to failure when random data is accessed.
A simple example (note it will take up to 1.7 GB on your pc!) I've did tests that took up to 20 GB of ram for testing, if you have the ram please do the more memory intensive tests to see for yourself.
List = range(50*1024*1024) del List
This process gets to around 1.6 GB. After we use the del List, the process goes back to 1.2GB.
But why is there such a gap? Well, the allocation and pointers are deleted but the fragmentation still exists as mentioned earlier. This is the problem our developer had in the beginning.
Testing the same thing in 3.5.3 we can see that the memory is used but after we use del we go back to 9-10 MB.
Does this mean that 3.5 solved this problem? For some instances YES. But for random data the problem still remains.
In the following example we will use Python 3.5.
We will create one million strings randomly allocated at 1KB (1024 bytes) which we repeat a few random times.
#NOTE: This can use up to a few GB of RAM.import randomimport stringimport timeimport gcgc.enable()#time.sleep(60)def random_string(size=6, chars=string.ascii_uppercase + string.digits):return ''.join(random.choice(chars) for _ in range(size))number = []string = []for x in range(0, 100000):if (x%20000==0):\tprint("We're at ", (x/10000) , " %")number.append(x * random.randrange(100,10000) + x)string.append(random_string(1024)*random.randrange(10,50))#string.append("Fragmentation of memory together with memory alignment and \\#virtual memory mapping will for sure screw up stuff if you use random data and not something Python can control like pointers to the same data fragment"*random.randrange(1,10)))print("Great We're done!")#NOTE A sleep of 1 minute makes it evidenttime.sleep(5)del numberdel stringprint("Deleted everything!")#while True:#\t0gc.enable()gc.collect()
By running this example you will see an increase in memory to a few GB and then when we free everything a small decrease.
Even though the data doesn't exist anymore after it's freed the fragmentation seems to have taken it's toll. If we would analyze the memory (well, it could actually still exists since nothing writes it over since there is no reference anymore)
Is the data NOT released to the Operating system?
By running this simple example we can then go to to a very important aspect.
Why do you think that web servers start new threads/processes for every request?
Or more modern they have a pool of processes and threads which get restarted once a variable number of connections have went through (1000-10000 connections)
We can observe that memory fragmentation through allocation deallocation is one of the biggest issues in long running systems.
Even if you optimize and review everything you will certainly require a process restart.
I've even forced garbage collection and it still failed.
What solutions are there?
We certainly have multiple solutions for this problem.
Depending on the programming language and the context we can do one of several things.
One of the best things is to have a supervisor that then spawns a new process that runs the memory intensive stuff.
Whenever we reach a certain limit or a time has passed we can spawn a new process, move the state through IPC (inter process communication) and close the old process.
This way we can be sure we have a running system and we don't lose the state.
If this is to difficult to implement you can then restart the process losing data and waiting some time.
Functional Versus Imperative
Scaling this solution in Imperative programming languages (Tcl, C, Python, Perl, Java) proves to be extremely difficult for the smallest tasks.
You will waste time writing code to handle state because imperative paradigms are different.
I've personally found that functional languages like Elixir and Erlang can solve this issue quite fast and easy with their high concurrency system.
The Supervisor functions from Elixir actually make it pretty easy to do the spawning.
It's actually what happens whenever a process crashes and is one of the great hallmarks of the Erlang BEAM!
Switching to a functional programming language means relearning new patterns of thinking. This will make it worth your time since you will gain the power of concurrency and fault tolerance.
Tcl version
I did some similar tests with Tcl.
There is no automatic garbage collector available
We;re stuck with a big runtime unless we explicitly free everything ourselves.
Using the C Api of Tcl means we need to be watchful of everything.
A Tcl only version grows the data to 2.6 GB and then when released we're left with 249 MB of data... still a lot considering that a new Tcl process requires under 10 MB of ram.
proc rnd {min max} {expr {int(($max - $min + 1) * rand()) + $min}} proc generateCode {length} {set string "azertyuiopqsdfghjklmwxcvbnAZERTYUIOPQSDFGHJKLMWXCVBN0123456789"set code ""set stringlength [expr {[string length $string]-1}]for {set i 0} {$i<$length} {incr i} {\tappend code [string index $string [rnd 0 $stringlength]]}return $code}set numbers(0) 0set max 1000000for {set i 1} {$i<$max} {incr i} {if {[set ok [expr {$i%($max/100)}]] == 0} { puts "$ok procent complete" }#\tset data($i)append data [string repeat "Intriguing I might add" [rnd 64 128]] set numbers($i) [expr {$i*$i+$i+[rnd 1000 1000000]}]}puts "OK done.."after 5000unset dataarray unset numbersputs "Deleted data"
C Version
A C version is futile since we know what free does. I'll leave the examples to you to do and see what happens:)
Elixir
Let\\s see how Elixir and and the BEAM handles things, shall we?
do_it = fn -> String.duplicate("Certainly, fascinating my dear fellow!", 1024*1024*512)enddo_it.()
My system reports 19 GB used by BEAM.
Not going away even if the strings where left unused.
Let's see if the garbage collector can do something.
:erlang.garbage_collect(self())
:erlang.system_flag(:fullsweep_after, 10)
Let's start the built in observer to see what's happening
:observer.start()
Seems there are 19 GB of binaries
Doing it the Elixir way just for showing off
defmodule HeavyLifting do def heavy_stuff() do max = 1024*1024 Enum.map(1..max, fn (x) -> {:crypto.strong_rand_bytes(1024), :rand.uniform(100000)} end) |> Enum.max() {:ok, "Cool"} endendHeavyLifting.heavy_stuff()
Beam grows to a few GB (1.4 GB in this example)
Sometimes it decreases automatically, other times we need to run the garbage collector ourselves and then it instantly decreases and sits at around 90 MB. Given that a clean iex uses around 40 MB i'd say this is a win since we don't have hundreds of MB lingering around. Think about the fact that we had REAL random strong bytes which couldn't be reused!
We can force the garbage collector and it will go down instantly :erlang.garbage_collect()
Of course using a functional programming language won't solve ALL your problems.
Erlang (Elixir inherits this) have some issues with certain binaries which won't seem to go away as fast as you would like them.
Conclusion
Bad programming styles eat up memory as much as random data..
You can also try a version where we have a static string which is repeated multiple times.
Even after release we see that we still have 600 MB remaining in certain cases.
Understanding what Linux does with caching and data:
https://www.linuxatemyram.com/
Details and more info
Wikipedia
Google
https://security.stackexchange.com/questions/20497/stack-overflows-defeating-canaries-aslr-dep-nx
Cover image by Markus Spiske from Pixino.com