Some programming language ideas
Published onThis post is me writing down some ideas for programming languages or programming language features. Mainly to get them out of my head into the world without having to write an actual compiler for them 😅 I start by providing some context and then present my ideas. Hopefully it’s interesting.
Explanation of existing concepts
Here are a few short explanations and references to existing concepts that I refer to later on. The explanations are meant to give you some intuition about them if you are not at all familiar with them.
Refinement types
A refinement type is a type with a a predicate attached to it that holds for every value of this type.
Rust is not a language that has user-definable refinement types, but you can find some examples of refinements in the standard library, like NonZero
and NonNull
.
The documentation for NonZero mentions that NonZero
can result in memory optimization.
For example, Option<NonZero<u32>>
is the same size as u32
because the compiler is able to use 0 to mean None
.
This is great! However, I think it is a shame that it is not possible to describe such optimizations ourselves.
It is not uncommon to have integer variables where not every value is valid.
The excess values could be used for something else, but you cannot express this.
Strict compile time safety
Wuffs is a memory-safe programming language that does have refinement types. Wuffs is a small language and meant to write decoders and encoders in. It is safe with respect to buffer overflows, integer arithmetic overflows and null pointer dereferences. All these checks are done at compile time. For integer variables the compiler understands what values it can have and thus if addition might result in an overflow or not.
Flow typing and union types
Flow-sensitive typing (or flow typing) is something I came into contact with at work using TypeScript. A type system with flow typing can refine the type of a variable through the flow of a program.
Below is a documented Typescript snippet to showcase flow typing and union types. You can run it in the TypeScript playground here.
type NumberOrString = number | string;
function whatIsIt(x: NumberOrString) {
// At this point x could be a number or a string, since we have not
// inspected x yet.
if (typeof x === 'string') {
// In this branch Typescript understands that x can only be a string.
console.log('x is a string');
return;
}
// If we reach this point then x was not a string, so x can only
// be a number. Typescript is able to strip part of the type since
// we performed a runtime check to allow this; testing if the type
// of x is a string or not.
if (typeof x === 'number') {
console.log('x is a number');
return;
}
// If we reach this point then x was neither a string nor a number,
// which should never happen if the program is well typed.
// Below is an assignment to a variable with type never.
// never is a special type.
// It means that no option exists of what this type could be.
// If we did not exhaustively check every possible variant
// that x could be then the assignment below would be a type
// checking error. You can try this yourself by making
// NumberOrString = number | string | boolean
const _exhaustivenessCheck: never = x;
// In a properly typed program this should never print.
console.log("What have you done?!");
}
const y: NumberOrString = 'hello';
whatIsIt(y);
// This is TypeScript for "Frick you. Just do the type cast."
whatIsIt(false as unknown as NumberOrString);
Some ideas!
Here are some ideas for features in programming languages.
Type refinement + union types on C-like structs
There is a PyPy blog post explaining their then new dictionary design (it’s a post from 2015).
Their design is not important here, just know that it uses an array of integers (sparse_array
) as indices into another array (compact_array
).
The value of the indices can only be as large as the size of the dictionary (num_items
).
So, if the dictionary has fewer than 256 entries, then using 8-bit indices is sufficient.
When the dictionary grows beyond this size you would need at least 16-bit integers, etc.
struct dict {
long num_items;
variable_int *sparse_array;
dict_entry *compact_array;
}
This means that the actual type (a 8-bit, 16-bit or 32-bit integer) of variable_int
depends on the value of num_items
.
You cannot express this relationship between variables in C or C++, but you could with union types and refinement.
type small_dict = struct {
u32(0..(2**8)) num_items;
u8 *sparse_array;
dict_entry *compact_array;
};
type medium_dict = struct {
u32((2**8)..(2**16)) num_items;
u16 *sparse_array;
dict_entry *compact_array;
};
type big_dict = struct {
u32((2**16)..(2**32)) num_items;
u32 *sparse_array;
dict_entry *compact_array;
};
type dict = small_dict | medium_dict | big_dict;
In the example the dict
comes in three variaties with either 8-bit, 16-bit or 32-bit indices.
You can differentiate between the three by inspecting the value of num_items
.
The syntax of u32(lo..hi)
is a meant as a refinement on u32
, where its value must lie in the range lo..hi
.
The double star **
means “to the power of”, so 2**8
is 256.
You still have to check at runtime what kind of dictionary you are dealing with, but you would have to do this anyway.
This does not incur extra cost, but the compiler can check if you are not accidentally assuming you are dealing
with a small_dict
without having done the necessary checks.
Other examples that also fall into the category of different interpretation of memory depending on some bit of data are NaN-boxing and Small String Optimization (SSO).
Also, this construction can be used to implement Rust-like enums yourself! Rust enums are implemented as tagged unions (as far as I know). Tagged unions are also used in TypeScript and with flow typing you can check if all variaties are handled. No explicit compiler support has to be written for enums and exhaustiveness checking.
Refinement based function overloading
This is an idea I had for Rust.
Or at least for a language that does some form of lifetime analysis.
The example below is for Rust and shows creating a Vec
with a capacity of 8.
A few elements are pushed, a reference to an element is taken and then an attempt is made to push again.
let mut a: Vec<i32> = Vec::with_capacity(8);
a.push(1);
a.push(2);
a.push(3);
let b: &i32 = &a[0];
a.push(4); // <- Highly illegal! You could go to jail for this (not really)
In Rust it is illegal to hold a reference to a Vec
element and then push an element.
You are holding an immutable reference and then try to mutably borrow it.
The reason you need to mutably borrow the Vec
for a push is that the underlying data where the elements are stored might move.
If a Vec
has no space left for a new element it has to move the elements to a place with more room for new elements.
However, in this example the push will not result in such a relocation so the reference will still be valid afterwards.
Even if you check that the Vec
has spare room you are not allowed to push while holding a reference.
Annoying!
If you could overload a function based on a refinement, then you could define a version of push
that does not require a mutable borrow,
and the usage would look no different!
That is a big if however 😄
fn push(&self, x: T) where self.len < self.cap {
// Yay, we can push x and all references are safe!
// ...
}
fn push(&mut self, x: T) {
// ...
}
Even if overloading would not be possible you could still define another version of push
that only can be called if you know that there is enough room.
fn push_safe(&self, x: T) where self.len < self.cap {
// ...
}
Refinement on memory addresses
Alignment requirements can be expressed through refinement predicates on addresses.
In the pseudocode address
is an implicitly available variable that represents the address of a value of the type.
type Int64 = [8]u8 where (address % 8 == 0); // Must be 8 byte aligned
Maybe on some embedded systems you have ranges of memory that are reserved for certain purposes. For example, the Game Boy has 8 KiB of VRAM mapped to memory range 0x8000-0xA000. It is possible to define pointers that are only allowed to point to VRAM.
type VRAMByte = u8 where (address >= 0x8000 && address < 0xA000);
type VRAMPointer = *VRAMByte;
If the program is well typed, you could never accidentally read or write outside of VRAM when using a VRAMPointer
.
Checked bending of the rules
Assume you have a language like Wuffs that can guarantee compile time safety for certain issues like integer arithmetic overflow.
This safety does come at a cost.
In order to guarantee that an addition will not overflow you need to check that the integer is below a certain value, for example.
Or to be sure that indexing an array will not be out of bounds you need to check the index value.
For this last case Rust enables out of bounds access checks for Vec
s in debug builds, but disables them for release builds.
What you could do is introduce a special function called assume
that can be given a predicate and
the compiler will pretend that you did the necessary checks to make the predicate true.
fn index(&self, i: uint) {
assume(i < self.len);
// We can now index data
return self.data[i];
}
When you add an assumption that turns out to not actually be true you will have introduced a bug.
But you can more easily find the bug because all your assumptions have been written down in code!
For debug builds you can turn the calls to assume
into assert
and find the problem.
assume
is essentially a debug_assert
except that it also ties in to the flow typing of the language.