Python question - first call is slower?

Thu Jun 21 14:48:48 IDT 2012

On Wed, Jun 20, 2012, Nadav Har'El wrote about "Python question - first call is slower?":
> Hi, I have run across a puzzling issue in Python, and I wonder if anyone
> on the list can explain it.
>...
> But strangely, it doesn't - the first call takes 14 milliseconds, while
> the second and all subsequent calls take only 2 milliseconds each.
> Does anybody have any idea why this can happen?

Thanks to everyone for all the ideas and suggestions. While trying to
simplify the example which showed this problem, I started to think that
this is perhaps not a general Python problem, but rather a problem
caused by one specific thing that the example program did which I
previously neglected to mention:

My example program actually *forked* right before that first call.

It turns out that (at least, this is my understanding now) memory allocation -
in Python but also in C - is significantly slower during this first call, but
returns to normal speed during the second call. Because Python relies on
a lot of memory allocations, the first call becomes significantly slower
than the second.

But why is allocation slower during the first run of that function?

The short explanation is copy-on-write page faults.
The long explanation:

Python's initialization before our fork() allocates and frees a lot of
memory, leaving a fragmented heap full of holes that can be reused by
future allocations.
Now, think of what happens after a fork(). The child gets a copy of this
fragmented heap; But this is not a real copy, but rather a copy-on-write -
pages are only copied when the child writes them. What happens now when the
child allocates memory? Since the heap is full of holes, one of these
holes will be returned, but when the child writes in it, it causes the
full page containing the hole to be copied - which is slow.

I include below an example program (in C) to demonstrate the problem.
It starts by creating a "swiss cheese" heap (100 MB allocated and then
half the memory is freed creating a lot of 1000 byte holes), and
runs a measure() function which allocates memory from these holes.
Before the fork, measure() takes 15 milliseconds each time (the first
and subsequent). After the fork, the first measure() takes 90
millisecond, and all the subsequent calls again take 15 milliseconds.

In theory, this problem can be elliminated if we could "compact" the
heap before calling fork(), or find a way to cause memory allocations
in the child to return memory from a new parts of the heap - not holes
in the parent's heap. Somewhat counter-intuitively, that second approach
will also save memory.

/* Try to show that memory allocation after fork is slower because of
   copy-on-write and previously freed memory - and this can explain
   the slowdown we see after the fork.
 */
#include <sys/time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

#define N 100000
#define M 1000
static measure(){
	/* See how much time it takes to allocate (and then free) N/2
	   chunks. Presumably, these will be returned from the holes left
	   in the heap by our previously freeing half of the N chunks
	   we allocated initially.
	*/
	int i;
	struct timeval old_time, new_time;
	long usec;
	void **p = calloc(N/2+1, sizeof(void *));
	void **pp = p;
	gettimeofday(&old_time, NULL);
	for(i=0; i<N; i+=2){
		*(pp++)=malloc(M);
	}
	pp=p;
	for(i=0; i<N; i+=2){
		free(*(pp++));
	}

        gettimeofday(&new_time, NULL);
        usec = new_time.tv_usec - old_time.tv_usec +
                1000000*(new_time.tv_sec - old_time.tv_sec);
	fprintf(stderr, "measure() - alloc and freeing N/2 chunks: %d usec\n", usec);

}

int main(int argc, char *argv[]){
	int i;
	struct timeval old_time, new_time;
	long usec;

	fprintf(stderr,"Preparing swiss-cheese heap:\n");
	fprintf(stderr,"  N=%d (chunks), M=%d (chunk size)\n",N,M);
	void **p = calloc(N, sizeof(void *));

	/* Allocate N chunks of M bytes */
	gettimeofday(&old_time, NULL);
	for(i=0; i<N; i++){
		p[i]=malloc(M);
	}
        gettimeofday(&new_time, NULL);
        usec = new_time.tv_usec - old_time.tv_usec +
                1000000*(new_time.tv_sec - old_time.tv_sec);
	fprintf(stderr, "  allocating N chunks: %d usec\n", usec);

	/* Free half the chunks - every second chunk, so that we have a lot
	   of free space in the heap, but no page can actually be fully
	   returned to the system.
	*/
	gettimeofday(&old_time, NULL);
	for(i=0; i<N; i+=2){
		free(p[i]);
	}
        gettimeofday(&new_time, NULL);
        usec = new_time.tv_usec - old_time.tv_usec +
                1000000*(new_time.tv_sec - old_time.tv_sec);
	fprintf(stderr, "  freeing N/2 chunks: %d usec\n", usec);

	fprintf(stderr, "\nIn original process. Note first run is just as fast as the rest:\n");
	measure();
	measure();
	measure();
	measure();

	if(fork() != 0){
		/* parent */
		wait(NULL);
	} else {
		/* child */
		fprintf(stderr, "\nIn child. Note how slow the first run in the child is!\n");
		measure();
		measure();
		measure();
		measure();
	}
}

-- 
Nadav Har'El                        |                  Thursday, Jun 21 2012, 
nyh at math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The person who knows how to laugh at
http://nadav.harel.org.il           |himself will never cease to be amused.