n Garbage Collection

Enums

In the Enums example below, days_of_week_t is a new type that can only have one of the values defined in the enum:

The typedef and its alias days_of_week_t are optional, but like with structs, they make the enum easier to use.


	    typedef enum DaysOfWeek {
	      MONDAY,
	      TACO_TUESDAY,
	      WEDNESDAY,
	      THURSDAY,
	      FRIDAY,
	      SATURDAY,
	      FUNDAY,
	    } days_of_week_t;
	  

An enum is not a collection type like a struct or an array. It's just a list of integers constrained to a new type, where each is given an explicit name.

  • MONDAY, which is 0
  • TACO_TUESDAY, which is 1
  • WEDNESDAY, which is 2
  • THURSDAY, which is 3
  • FRIDAY, which is 4
  • SATURDAY, which is 5
  • FUNDAY, which is 6

You can use the enum type like this

	    
	      typedef struct Event {
	      char *title;
	      days_of_week_t day;
	      } event_t;

	      // Or if you don't want to use the alias:

	      typedef struct Event {
	      char *title;
	      enum DaysOfWeek day;
	      } event_t;
	    
	  

Sometimes, you don't just want to enumerate some names (where the underlying integer constant values don't really matter)... you want to set those enumerations to specific values. For example, you might want to define a program's exit status codes:

	    
		typedef enum {
		  EXIT_SUCCESS = 0,
		  EXIT_FAILURE = 1,
		  EXIT_COMMAND_NOT_FOUND = 127,
		} ExitStatus;
	    
	  

Alternatively, you can define the first value and let the compiler fill in the rest (incrementing by 1):

	  
	     typedef enum {
	       LANE_WPM = 200,
	       PRIME_WPM, // 201
	       TEEJ_WPM,  // 202
	     } WordsPerMinute;
	  
	

One of the best features of enums is that it can be used in switch statements. Enums + switch statements:

  • Avoid "magic numbers"
  • Use descriptive names
  • With modern tooling, will give you an error/warning that you haven't handled all the cases in your switch

Enums are often used to represent the possibilities in a set. For example:

  • SMALL = 0
  • MEDIUM = 1
  • LARGE = 2
  • EXTRA_LARGE = 3

Your code probably cares a lot about which size a variable rbbepresents, but it probably doesn't care that SMALL happens to be 0 under the hood. From the compiler's perspective, enums are just fancy integers.

Unions

Unions in C can hold one of several types.

The age_or_name_t type can hold either an int or a char *, but not both at the same time (that would be a struct). We provide the list of possible types so that the C compiler knows the maximum potential memory requirement, and can account for that.

This is how the union is used:

	    
	  age_or_name_t lane = { .age = 29 };
	  printf("age: %d\n", lane.age);
	  // age: 29
	    
	  

Here's where it gets interesting. What happens if we try to access the name field (even though we set the age field)?

printf("name: %s\n", lane.name);

We get... nothing? To be more specific, we get undefined behavior. A union only reserves enough space to hold the largest type in the union and then all of the fields use the same memory. So when we set .age to 29, we are writing the integer representation of 29 to the memory of the lane union:

0000 0000 0000 0000 0000 0000 0001 1101

Then if we try to access .name, we read from the same block of memory but try to interpret the bytes as a char *, which is why we get garbage (which is interpreted as nothing in this case). Put simply, setting the value of .ageaweflkdfsa overwrites the value of .name and vice versa, and you should only access the field that you set.

Unions store their value in the same memory location, no matter which field or type is actively being used. That means that accessing any field apart from the one you set is generally a bad idea.

A downside of unions is that the size of the union is the size of the largest field in the union. Take this example:


typedef union IntOrErrMessage {
  int data;
  char err[256];
} int_or_err_message_t;
	  

This IntOrErrMessage union is designed to hold an int 99% of the time. However, when the program encounters an error, instead of storing an integer here, it will store an error message. The trouble is that it's incredibly inefficient because it allocates 256 bytes for every int that it stores!

Imagine an array of 1000 int_or_err_message_t objects. Even if none of them make use of the .err field, the array will take up 256 * 1000 = 256,000 bytes of memory! An array of ints would have only taken 4,000 bytes (assuming 32-bit integers).

One interesting (albeit not commonly used) trick is to use unions to create "helpers" for accessing different parts of a piece of memory. Consider the following:

Only 4 bytes are used. And, unlike in 99% of scenarios, it makes sense to both set and get values from this union through both the components and rgba fields! Both fields in the union are exactly 32 bits in size, which means that we can "safely" (?) access the entire set of colors through the .rgba field, or get a single color component through the .components field. The convenience of additional fields, with the efficiency of a single memory location!

stack and heap

Allocating memory on the stack is preferred when possible because the stack is faster and simpler than the heap.

  • Efficient Pointer Management: Stack "allocation" is just a quick increment or decrement of the stack pointer, which is extremely fast. Heap allocations require more complex bookkeeping.
  • Cache-Friendly Memory Access: Stack memory is stored in a contiguous block, enhancing cache performance due to spatial locality.
  • Automatic Memory Management: Stack memory is managed automatically as functions are called and as they return.
  • Inherent Thread Safety: Each thread has its own stack. Heap allocations require synchronization mechanisms when used concurrently, potentially introducing overhead.

So the stack is great and all, but one of the downsides is that it has a limited size. If you keep pushing frames onto the stack without popping them off, you'll eventually run out of memory and get a stack overflow. (yes, that's what the famous site is named after)

That's one of the reasons recursion without tail-call optimization can be dangerous. Each recursive call pushes a new frame onto the stack, and if you have too many recursive calls, you'll run out of stack space.

So we know that stack frames are always getting pushed and popped, and as a result, memory addresses on the stack are always changing and getting reused.

"The heap", as opposed to "the stack", is a pool of long-lived memory shared across the entire program. Stack memory is automatically allocated and deallocated as functions are called and return, but heap memory is allocated and deallocated as-needed, independent of the burdensome shackles of function calls.

When you need to store data that outlives the function that created it, you'll send it to the heap. The heap is called "dynamic memory" because it's allocated and deallocated as needed. Take a look at new_int_array:

Because the size of the array isn't known at compile time, we can't put it on the stack. Instead, we allocate memory on the heap using the stdlib.h malloc function. It takes a number of bytes to allocate as an argument (size * sizeof(int)) and returns a pointer to the allocated memory (a void * that we cast to an int*). Here's a diagram of what happened in memory:

The new_int_array function's size argument is just an integer, it's pushed onto the stack. Assuming size is 6, when malloc is called we're given enough memory to store 6 integers on the heap, and we're given the address of the start of that newly allocated memory. We store it in a new local variable called new_arr. The address is stored on the stack, but the data it points to is in the heap.

Let's look at some code that uses new_int_array:

When we're done with the memory, we need to manually deallocate it using the stdlib.h's free function:

The free function returns (deallocates) that memory for use elsewhere. It's important to note that the pointer (arr_of_6) still exists, but shouldn't be used. It's a "dangling pointer", pointing to deallocated memory.

The malloc function (memory allocation) is a standard library function in C that allocates a specified number of bytes of memory on the heap and returns a pointer to the allocated memory.

This new memory is uninitialized, which means:

  • It contains whatever data was previously at that location.
  • It is the programmer's responsibility to ensure that the allocated memory is properly initialized and eventually freed using free to avoid memory leaks.

If you want to make sure that the memory is properly initialized, you can use the calloc function, which allocates the specified number of bytes of memory on the heap and returns a pointer to the allocated memory. This memory is initialized to zero (meaning it contains all zeroes).

malloc function signature

void* malloc(size_t size);
  • size: The number of bytes to allocate.
  • Returns: A pointer to the allocated memory or NULL if the allocation fails

This idea of manually calling malloc and free is what puts the "manual" in "manually managing memory":

The programmer must remember to eventually free the allocated memory using free(ptr) to avoid memory leaks. Otherwise, that allocated memory is never returned to the operating system for use by other programs. (Until the program exits, at which point the operating system will clean up after it, but that's not ideal.)

Manually managing memory can be error-prone and tedious, but languages that automatically manage memory (like Python, Java, and C#) have their own trade-offs, usually in terms of performance.

The free function deallocates memory that was previously allocated by malloc, calloc, or realloc.

IMPORTANT: free does not change the value stored in the memory, and it doesn't even change the address stored in the pointer. Instead, it simply informs the Operating System that the memory can be used again.

Forgetting to call free leads to a memory leak. This means that the allocated memory remains occupied and cannot be reused, even though the program no longer needs it. Over time, if a program continues to allocate memory without freeing it, the program may run out of memory and crash.

Memory leaks are one of the most common bugs in C programs, and they can be difficult to track down because the memory is still allocated and accessible, even though it is no longer needed.

Pointer pointers allow you to create complex data structures like arrays of pointers, and to modify pointers indirectly. The syntax is exactly what you would expect:


int value;
int *pointer;
int **pointer_pointer;
	  

Pointers to pointers (or pointers to pointers to pointers to pointers... you get the idea) are like a treasure map or a scavenger hunt. You start at one pointer and keep following the chain of addresses until you get to the final value. It's just a chain of dereferences.

Making an array of integers on the heap is pretty simple:


int *int_array = malloc(sizeof(int) * 3);
int_array[0] = 1;
int_array[1] = 2;
int_array[2] = 3;
	  

But we can also make an array of pointers! It's quite common to do this in C, especially considering that strings are just pointers to chars:


char **string_array = malloc(sizeof(char *) * 3);
string_array[0] = "foo";
string_array[1] = "bar";
string_array[2] = "baz";
 

We've already discussed void, which essentially means "nothing" in C. It's used in a few different contexts:

  • void update_soldier(soldier_t *s): means the function returns nothing
  • soldier_t new_soldier(void): means the function takes no arguments.

And, because C likes re-using ideas but with slightly different meanings (the genius of the design can't be understood by us mere mortals) void also has another use!

A void * "void pointer" tells the compiler that this pointer could point to anything. This is why void pointers are also known as a "generic pointer". Since void pointers do not have a specific data type, they cannot be directly dereferenced or used in pointer arithmetic without casting them to another pointer type first.

Casting to and from void pointers in C is unique because void pointers are type-agnostic. When casting a specific type pointer to a void pointer, no type information is retained, allowing the void pointer to point to any data type. However, you must cast a void pointer back to its original type before dereferencing it, as direct dereferencing is not possible.


int number = 42;
void *generic_ptr = &number;

// This doesn't work
printf("Value of number: %d\n", *generic_ptr);

// This works: Cast to appropriate type before dereferencing
printf("Value of number: %d\n", *(int*)generic_ptr);
	  

A common pattern is to store generic data in one variable, and the type of that data in another variable. This is useful when you need to pass data around without knowing its type at compile time.

Happy Thanksgiving!!

GPA pattern warnings

when you follow the previous pattern you end up with code like this, where every single object in your program, that can't be statically allocated, has pairs of functions for alloc and release.

This leads to this pattern being used for very granular “objects”, which explodes the number of actual dynamic allocations and deallocations, potentially having to track 1000's of lifetimes, that may be interdependenat on one aonther. This makes it far more likely that a programming mistake occurs.

the programmer is responable for keeping track of malloc's and free's. Each object has it's OWN LIFETIME that the programmer must manage. Mismanagment will lead to "memory leaks" and "dangling pointers", and potential runtime failure.

other strategies

Other strategies try to group objects with similiar lifetimes together, with the ability to free all there memory with a single instruction. Which makes mgmt less error prone, but also does not bear the overhead incurred by using a garbage collector

A memory leak is when your program continually calls malloc and never free's the allocated memory. This eventually leads to your program eventually crashing, when it uses up it's entire address space.

A dangling pointer, is a pointer that is pointing at an address that has been free's already. So any operations using that pointer lead to some very unexpected results.

stable and fast allocation

We can write programs that do not encounter failure cases, by setting hard limits on the amount of memory we are going to use for each subsystem and grouping objects of similiar lifetime together.

  • bump allocator
  • free list allocator
  • arena allocator

Cache, cpu memory diagram