Friday, May 23, 2014

Rust for C++ programmers - part 7: data types

In this post I'll discuss Rust's data types. These are roughly equivalent to classes, structs, and enums in C++. One difference with Rust is that data and behaviour are much more strictly separated in Rust than C++ (or Java, or other OO languages). Behaviour is defined by functions and those can be defined in traits and `impl`s (implementations), but traits cannot contain data, they are similar to Java's interfaces in that respect. I'll cover traits and impls in a later post, this one is all about data.

Structs


A rust struct is similar to a C struct or a C++ struct without methods. Simply a list of named fields. The syntax is best seen with an example:
struct S {
    field1: int,
    field2: SomeOtherStruct
}

Here we define a struct called `S` with two fields. The fields are comma separated; if you like, you can comma-terminate the last field too.

Structs introduce a type. In the example, we could use `S` as a type. `SomeOtherStruct` is assumed to be another struct (used as a type in the example), and (like C++) it is included by value, that is, there is no pointer to another struct object in memory.

Fields in structs are accessed using the `.` operator and their name. An example of struct use:
fn foo(s1: S, s2: &S) {
    let f = s1.field1;
    if f == s2.field1 {
        println!("field1 matches!");
    }
}

Here `s1` is struct object passed by value and `s2` is a struct object passed by reference. As with method call, we use the same `.` to access fields in both, no need for `->`.

Structs are initialised using struct literals. These are the name of the struct and values for each field. For example,
fn foo(sos: SomeOtherStruct) {
    let x = S { field1: 45, field2: sos };  // initialise x with a struct literal
    println!("x.field1 = {}", x.field1);
}

Structs cannot be recursive, that is you can't have cycles of struct names involving definitions and field types. This is because of the value semantics of structs. So for example, `struct R { r: Option<R> }` is illegal and will cause a compiler error (see below for more about Option). If you need such a structure then you should use some kind of pointer; cycles with pointers are allowed:
struct R {
    r: Option<Box<R>>
}

(Note, `Box` is the new syntax for a unique/owning pointer, this has changed from `~R` which was introduced a few posts back).

If we didn't have the `Option` in the above struct, there would be no way to instantiate the struct and Rust would signal an error.

Structs with no fields do not use braces in either their definition or literal use. Definitions do need a terminating semi-colon though, presumably just to facilitate parsing.
struct Empty;

fn foo() {
    let e = Empty;
}

Tuples


Tuples are anonymous, heterogeneous sequences of data. As a type, they are declared as a sequence of types in parentheses. Since there is no name, they are identified by structure. For example, the type `(int, int)` is a pair of integers and `(i32, f32, S)` is a triple. Enum values are initialised in the same way as enum types are declared, but with values instead of types for the components, e.g., `(4, 5)`. An example:
// foo takes a struct and returns a tuple
fn foo(x: SomeOtherStruct) -> (i32, f32, S) {
    (23, 45.82, S { field1: 54, field2: x })
}

Tuples can be used by destructuring using a `let` expression, e.g.,

fn bar(x: (int, int)) {
    let (a, b) = x;
    println!("x was ({}, {})", a, b);
}

We'll talk more about destructuring next time.

Tuple structs


Tuple structs are named tuples, or alternatively, structs with unnamed fields. They are declared using the `struct` keyword, a list of types in parentheses, and a semicolon. Such a declaration introduces their name as a type. Their fields must be accessed by destructuring (like a tuple), rather than by name. Tuple structs are not very common.
struct IntPoint (int, int);

fn foo(x: IntPoint) {
    let IntPoint(a, b) = x;  // Note that we need the name of the tuple
                             // struct to destructure.
    println!("x was ({}, {})", a, b);
}

Enums


Enums are types like C++ enums or unions, in that they are types which can take multiple values. The simplest kind of enum is just like a C++ enum:
enum E1 {
    Var1,
    Var2,
    Var3
}

fn foo() {
    let x: E1 = Var2;
    match x {
        Var2 => println!("var2"),
        _ => {}
    }
}

However, Rust enums are much more powerful than that. Each variant can contain data. Like tuples, these are defined by a list of types. In this case they are more like unions than enums in C++. Rust enums are tagged unions rather untagged (as in C++), that means you can't mistake one variant of an enum for another at runtime. An example:
enum Expr {
    Add(int, int),
    Or(bool, bool),
    Lit(int)
}

fn foo() {
    let x = Or(true, false);   // x has type Expr
}

Many simple cases of object-oriented polymorphism are better handled in Rust using enums.

To use enums we usually use a match expression. Remember that these are similar to C++ switch statements. I'll go into more depth on these and other ways to destructure data next time. Here's an example:
fn bar(e: Expr) {
    match e {
        Add(x, y) => println!("An `Add` variant: {} + {}", x, y),
        Or(..) => println!("An `Or` variant"),
        _ => println!("Something else (in this case, a `Lit`)"),
    }
}

Each arm of the match expression matches a variant of `Expr`. All variants must be covered. The last case (`_`) covers all remaining variants, although in the example there is only `Lit`. Any data in a variant can be bound to a variable. In the `Add` arm we are binding the two ints in an `Add` to `x` and `y`. If we don't care about the data, we can use `..` to match any data, as we do for `Or`.

Option


One particularly common enum in Rust is `Option`. This has two variants - `Some` and `None`. `None` has no data and `Some` has a single field with type `T` (`Option` is a generic enum, which we will cover later, but hopefully the general idea is clear from C++). Options are used to indicate a value might be there or might not. Any place you use a null pointer in C++ to indicate a value which is in some way undefined, uninitialised, or false, you should probably use an Option in Rust. Using Option is safer because you must always check it before use; there is no way to do the equivalent of dereferencing a null pointer. They are also more general, you can use them with values as well as pointers. An example:
use std::rc::Rc;

struct Node {
    parent: Option<Rc<Node>>,
    value: int
}

fn is_root(node: Node) -> bool {
    match node.parent {
        Some(_) => false,
        None => true
    }
}
Here, the parent field could be either a `None` or a `Some` containing an `Rc<Node>`. In the example, we never actually use that payload, but in real life you usually would.


There are also convenience methods on Option, so you could write the body of `is_root` as `node.is_none()` or `!node.is_some()`.

Inherited mutabilty and Cell/RefCell


Local variables in Rust are immutable by default and can be marked mutable using `mut`. We don't mark fields in structs or enums as mutable, their mutability is inherited. This means that a field in a struct object is mutable or immutable depending on whether the object itself is mutable or immutable. Example:
struct S1 {
    field1: int,
    field2: S2
}
struct S2 {
    field: int
}

fn main() {
    let s = S1 { field1: 45, field2: S2 { field: 23 } };
    // s is deeply immutable, the following mutations are forbidden
    // s.field1 = 46;
    // s.field2.field = 24;

    let mut s = S1 { field1: 45, field2: S2 { field: 23 } };
    // s is mutable, these are OK
    s.field1 = 46;
    s.field2.field = 24;
}

Inherited mutability in Rust stops at references. This is similar to C++ where you can modify a non-const object via a pointer from a const object. If you want a reference field to be mutable, you have to use `&mut` on the field type:
struct S1 {
    f: int
}
struct S2<'a> {
    f: &'a mut S1   // mutable reference field
}
struct S3<'a> {
    f: &'a S1       // immutable reference field
}

fn main() {
    let mut s1 = S1{f:56};
    let s2 = S2 { f: &mut s1};
    s2.f.f = 45;   // legal even though s2 is immutable
    // s2.f = &mut s1; // illegal - s2 is not mutable
    let s1 = S1{f:56};
    let mut s3 = S3 { f: &s1};
    s3.f = &s1;     // legal - s3 is mutable
    // s3.f.f = 45; // illegal - s3.f is immutable
}

(The `'a` parameter on `S2` and `S3` is a lifetime parameter, we'll cover those soon).

Sometimes whilst an object is logically mutable, it has parts which need to be internally mutable. Think of various kinds of caching or a reference count (which would not give true logical immutability since the effect of changing the ref count can be observed via destructors). In C++, you would use the `mutable` keyword to allow such mutation even when the object is const. In Rust we have the Cell and RefCell structs. These allow parts of immutable objects to be mutated. Whilst that is useful, it means you need to be aware that when you see an immutable object in Rust, it is possible that some parts may actually be mutable.

RefCell and Cell let you get around Rust's strict rules on mutation and aliasability. They are safe to use because they ensure that Rust's invariants are respected dynamically, even though the compiler cannot ensure that those invariants hold statically. Cell and RefCell are both single threaded objects.

Use Cell for types which have copy semantics (pretty much just primitive types). Cell has `get` and `set` methods for changing the stored value, and a `new` method to initialise the cell with a value. Cell is a very simple object - it doesn't need to do anything smart since objects with copy semantics can't keep references elsewhere (in Rust) and they can't be shared across threads, so there is not much to go wrong.

Use RefCell for types which have move semantics, that means nearly everything in Rust, struct objects are a common example. RefCell is also created using `new` and has a `set` method. To get the value in a RefCell, you must borrow it using the borrow methods (`borrow`, `borrow_mut`, `try_borrow`, `try_borrow_mut`) these will give you a borrowed reference to the object in the RefCell. These methods follow the same rules as static borrowing - you can only have one mutable borrow, and can't borrow mutably and immutably at the same time. However, rather than a compile error you get a runtime failure. The `try_` variants return an Option - you get `Some(val)` if the value can be borrowed and `None` if it can't. If a value is borrowed, calling `set` will fail too.

Here's an example using a ref-counted pointer to a RefCell (a common use-case):
use std::rc::Rc;
use std::cell::RefCell;

Struct S {
    field: int
}

fn foo(x: Rc<RefCell<S>>) {
    {
        let s = x.borrow();
        println!("the field, twice {} {}", s.f, x.borrow().field);
        // let s = x.borrow_mut(); // Error - we've already borrowed the contents of x
    }

    let s = x.borrow_mut(); // O, the earlier borrows are out of scope
    s.f = 45;
    // println!("The field {}", x.borrow().field); // Error - can't mut and immut borrow
    println!("The field {}", s.f);
}
If you're using Cell/RefCell, you should try to put them on the smallest object you can. That is, prefer to put them on a few fields of a struct, rather than the whole struct. Think of them like single threaded locks, finer grained locking is better since you are more likely to avoid colliding on a lock.

Sunday, May 18, 2014

Rust for C++ programmers - part 6: Rc, Gc, and * pointers

So far we've covered unique and borrowed pointers. Unique pointers are very similar to the new std::unique_ptr in C++ and borrowed references are the 'default' pointer you usually reach for if you would use a pointer or reference in C++. Rust has a few more, rarer pointers either in the libraries or built in to the language. These are mostly similar to various kinds of smart pointers you might be used to in C++.

This post took a while to write and I still don't like it. There are a lot of loose ends here, both in my write up and in Rust itself. I hope some will get better with later posts and some will get better as the language develops. If you are learning Rust, you might even want to skip this stuff for now, hopefully you won't need it. Its really here just for completeness after the posts on other pointer types.

It might feel like Rust has a lot of pointer types, but it is pretty similar to C++ once you think about the various kinds of smart pointers available in libraries. In Rust, however, you are more likely to meet them when you first start learning the language. Because Rust pointers have compiler support, you are also much less likely to make errors when using them.

I'm not going to cover these in as much detail as unique and borrowed references because, frankly, they are not as important. I might come back to them in more detail later on.

Rc<T>

Reference counted pointers come as part of the rust standard library. They are in the `std::rc` module (we'll cover modules soon-ish. The modules are the reason for the `use` incantations in the examples). A reference counted pointer to an object of type `T` has type `Rc<T>`. You create reference counted pointers using a static method (which for now you can think of like C++'s, but we'll see later they are a bit different) - `Rc::new(...)` which takes a value to create the pointer to. This constructor method follows Rust's usual move/copy semantics (like we discussed for unique pointers) - in either case, after calling Rc::new, you will only be able to access the value via the pointer.

As with the other pointer types, the `.` operator does all the dereferencing you need it to. You can use `*` to manually dereference.

To pass a ref-counted pointer you need to use the `clone` method. This kinda sucks, and hopefully we'll fix that, but that is not for sure (sadly). You can take a (borrowed) reference to the pointed at value, so hopefully you don't need to clone too often. Rust's type system ensures that the ref-counted variable will not be deleted before any references expire. Taking a reference has the added advantage that it doesn't need to increment or decrement the ref count, and so will give better performance (although, that difference is probably marginal since Rc objects are limited to a single thread and so the ref count operations don't have to be atomic). As in C++, you can also take a reference to the Gc pointer.

An Rc example:
use std::rc::Rc;

fn bar(x: Rc<int>) { }
fn baz(x: &int) { }

fn foo() {
    let x = Rc::new(45);
    bar(x.clone());   // Increments the ref-count
    baz(&*x);         // Does not increment
    println!("{}", 100 - *x);
}  // Once this scope closes, all Rc pointers are gone, so ref-count == 0
   // and the memory will be deleted.
Ref counted pointers are always immutable. If you want a mutable ref-counted object you need to use a RefCell (or Cell) wrapped in an `Rc`.

Cell and RefCell

Cell and RefCell are structs which allow you to 'cheat' the mutability rules. This is kind of hard to explain without first covering Rust data structures and how they work with mutability, so I'm going to come back to these slightly tricky objects later. For now, you should know that if you want a mutable, ref counted object you need a Cell or RefCell wrapped in an Rc. As a first approximation, you probably want Cell for primitive data and RefCell for objects with move semantics. So, for a mutable, ref-counted int you would use `Rc<Cell<int>>`.

Gc

Garbage collected pointers are written using `Gc<T>` and are created and used much like ref-counted pointers. The difference is that Gc pointers do not automatically turn into references, you need to explicitly call `borrow()`. That is (afaik) not by design and should get fixed. You also don't need to use `clone()` to copy Gc pointers like you do with Rc pointers, again, I hope this inconsistency gets addressed (in theory, this is because copying an Rc pointer does some work incrementing the ref count, whereas a Gc copy is just a copy). Currently Gc is a prototype implementation and is actually implemented using reference counting (so if you have cycles, you are screwed). You should probably just use Rc pointers for now (indeed, the Gc module is marked as experimental which means we expect it to change considerably in the near future and may even be removed).

Gc example:
use std::gc::Gc;

fn bar(x: Gc<int>) { }
fn baz(x: &int) { }

fn foo() {
    let x = Gc::new(45);
    bar(x);
    baz(x.borrow());
    println!("{}", 100 - *x.borrow());
}

 *T - unsafe pointers

Finally Rust has unsafe pointers. These are denoted `*T` and are created using `&` (you might need to specify a type to get a `*T` rather than a `&T` since the `&` operator can create either a borrowed reference or an unsafe pointer). These are like C pointers, just a pointer to memory with no restrictions on how they are used (you can't do pointer arithmetic without casting, but you can do it that way if you must). Unsafe pointers are the only pointer type in Rust which can be null. There is no automatic dereferencing of unsafe pointers (so to call a method you have to write `(*x).foo()`) and no automatic referencing. The most important restriction is that they can't be dereferenced (and thus can't be used) outside of an unsafe block. In regular Rust code you can only pass them around.

So, what is unsafe code? Rust has strong safety guarantees, and (rarely) they prevent you doing something you need to do. Since Rust aims to be a systems language, it has to be able to do anything that is possible and sometimes that means doing things the compiler can't verify is safe. To accomplish that, Rust has the concept of unsafe blocks, marked by the `unsafe` keyword. In unsafe code you can do unsafe things - dereference an unsafe pointer, index into an array without bounds checking, call code written in another language via the FFI, or cast variables. Obviously, you have to be much more careful writing unsafe code than writing regular Rust code. In fact, you should only very rarely write unsafe code. Mostly it is used in very small chunks in libraries, rather than in client code. In unsafe code you must do all the things you normally do in C++ to ensure safety. Furthermore, you must ensure that by the time the unsafe block finishes, you have re-established all of the invariants that the Rust compiler would usually enforce, otherwise you risk causing bugs in safe code too.

An example of using an unsafe pointer:
fn foo() {
    let x = 5;
    let xp: *int = &5;
    println!("x+5={}", add_5(xp));
}

fn add_5(p: *int) -> int {
    unsafe {
        if !p.is_null() { // Note that *-pointers do not auto-deref, so this is
                          // a method implemented on *int, not int.
            *p + 5
        } else {
            -1            // Not a recommended error handling strategy.
        }
    }
}
As with borrowed references, unsafe pointers are immutable by default and can be made mutable using the `mut` keyword, for example `*mut int`.

And that concludes our tour of Rust's pointers. Next time we'll take a break from pointers and look at Rust's data structures. We'll come back to borrowed references again in a later post though.

Tuesday, May 06, 2014

Rust for C++ programmers - part 5: borrowed references

In the last post I introduced unique pointers. This time I will talk about another kind of pointer which is much more common in most Rust programs: borrowed pointers (aka borrowed references, or just references).

If we want to have a reference to an existing value (as opposed to creating a new value on the heap and pointing to it, as with unique pointers), we must use `&`, a borrowed reference. These are probably the most common kind of pointer in Rust, and if you want something to fill in for a C++ pointer or reference (e.g., for passing a parameter to a function by reference), this is probably it.

We use the `&` operator to create a borrowed reference and to indicate reference types, and `*` to dereference them. The same rules about automatic dereferencing apply as for unique pointers. For example,
fn foo() {
    let x = &3;   // type: &int
    let y = *x;   // 3, type: int
    bar(x, *x);
    bar(&y, y);
}

fn bar(z: &int, i: int) {
    // ...
}
The `&` operator does not allocate memory (we can only create a borrowed reference to an existing value) and if a borrowed reference goes out of scope, no memory gets deleted.

Borrowed references are not unique - you can have multiple borrowed references pointing to the same value. E.g.,

fn foo() {
    let x = 5;                // type: int
    let y = &x;               // type: &int
    let z = y;                // type: &int
    let w = y;                // type: &int
    println!("These should all 5: {} {} {}", *w, *y, *z);
}
Like values, borrowed references are immutable by default. You can also use `&mut` to take a mutable reference, or to denote mutable reference types. Mutable borrowed references are unique (you can only take a single mutable reference to a value, and you can only have a mutable reference if there are no immutable references). You can use mutable reference where an immutable one is wanted, but not vice versa. Putting all that together in an example:
fn bar(x: &int) { ... }
fn bar_mut(x: &mut int) { ... }  // &mut int is a reference to an int which
                                 // can be mutated

fn foo() {
    let x = 5;
    //let xr = &mut x;     // Error - can't make a mutable reference to an
                           // immutable variable
    let xr = &x;           // Ok (creates an immutable ref)
    bar(xr);
    //bar_mut(xr);         // Error - expects a mutable ref

    let mut x = 5;
    let xr = &x;           // Ok (creates an immutable ref)
    //*xr = 4;             // Error - mutating immutable ref
    //let xr = &mut x;     // Error - there is already an immutable ref, so we
                           // can't make a mutable one

    let mut x = 5;
    let xr = &mut x;       // Ok (creates a mutable ref)
    *xr = 4;               // Ok
    //let xr = &x;         // Error - there is already a mutable ref, so we
                           // can't make an immutable one
    //let xr = &mut x;     // Error - can only have one immutable ref at a time
    bar(xr);               // Ok
    bar_mut(xr);           // Ok
}
Note that the reference may be mutable (or not) independently of the mutableness of the variable holding the reference. This is similar to C++ where pointers can be const (or not) independently of the data they point to. This is in contrast to unique pointers, where the mutableness of the pointer is linked to the mutableness of the data. For example,
fn foo() {
    let mut x = 5;
    let mut y = 6;
    let xr = &mut x;
    //xr = &mut y;        // Error xr is immutable

    let mut x = 5;
    let mut y = 6;
    let mut xr = &mut x;
    xr = &mut y;          // Ok

    let mut x = 5;
    let mut y = 6;
    let mut xr = &x;
    xr = &y;              // Ok - xr is mut, even though the referenced data is not
}
If a mutable value is borrowed, it becomes immutable for the duration of the borrow. Once the borrowed pointer goes out of scope, the value can be mutated again. This is in contrast to unique pointers, which once moved can never be used again. For example,
fn foo() {
    let mut x = 5;            // type: int
    {
        let y = &x;           // type: &int
        //x = 4;              // Error - x has been borrowed
        println!("{}", x);    // Ok - x can be read
    }
    x = 4;                    // OK - y no longer exists
}
The same thing happens if we take a mutable reference to a value - the value still cannot be modified. In general in Rust, data can only ever be modified via one variable or pointer. Furthermore, since we have a mutable reference, we can't take an immutable reference. That limits how we can use the underlying value:
fn foo() {
    let mut x = 5;            // type: int
    {
        let y = &mut x;       // type: &mut int
        //x = 4;              // Error - x has been borrowed
        //println!("{}", x);  // Error - requires borrowing x
        let z = *y + x;       // Ok - doesn't require borrowing
    }
    x = 4;                    // OK - y no longer exists
}
Unlike C++, Rust won't automatically reference a value for you. So if a function takes a parameter by reference, the caller must reference the actual parameter. However, pointer types will automatically be converted to a reference:
fn foo(x: &int) { ... }

fn bar(x: int, y: ~int) {
    foo(&x);
    // foo(x);   // Error - expected &int, found int
    foo(y);      // Ok
    foo(&*y);    // Also ok, and more explicit, but not good style
}

`mut` vs `const`

At this stage it is probably worth comparing `mut` in Rust to `const` in C++. Superficially they are opposites. Values are immutable by default in Rust and can be made mutable by using `mut`. Values are mutable by default in C++, but can be made constant by using `const`. The subtler and more important difference is that C++ const-ness applies only to the current use of a value, whereas Rust's immutability applies to all uses of a value. So in C++ if I have a `const` variable, someone else could have a non-const reference to it and it could change without me knowing. In Rust if you have an immutable variable, you are guaranteed it won't change.

As we mentioned above, all mutable variables are unique. So if you have a mutable value, you know it is not going to change unless you change it. Furthermore, you can change it freely since you know that no one else is relying on it not changing.

Borrowing and lifetimes

One of the primary safety goals of Rust is to avoid dangling pointers (where a pointer outlives the memory it points to). In Rust, it is impossible to have a dangling borrowed reference. It is only legal to create a borrowed reference to memory which will be alive longer than the reference (well, at least as long as the reference). In other words, the lifetime of the reference must be shorter than the lifetime of the referenced value.

That has been accomplished in all the examples in this post. Scopes introduced by `{}` or functions are bounds on lifetimes - when a variable goes out of scope its lifetime ends. If we try to take a reference to a shorter lifetime, such as in a narrower scope, the compiler will give us an error. For example,
fn foo() {
    let x = 5;
    let mut xr = &x;  // Ok - x and xr have the same lifetime
    {
        let y = 6;
        //xr = &y     // Error - xr will outlive y
    }                 // y is released here
}                     // x and xr are released here
In the above example, x and xr don't have the same lifetime because xr starts later than x, but it's the end of lifetimes which is more interesting, since you can't reference a variable before it exists in any case - something else which Rust enforces and which makes it safer than C++.

Explicit lifetimes

After playing with borrowed pointers for a while, you'll probably come across borrowed pointers with an explicit lifetime. These have the syntax `&'a T` (cf `&T`). They're kind of a big topic since I need to cover lifetime-polymorphism at the same time so I'll leave it for another post (there are a few more less common pointer types to cover first though). For now, I just want to say that `&T` is a shorthand for `&'a T` where `a` is the current scope, that is the scope in which the type is declared.

Saturday, May 03, 2014

A thought on language design

Language designers often aim to make their languages simpler, more elegant, or smaller. If we are doing practical language design (as opposed to design for fun or research), these are not goals, only proxies. Programming languages should be easy to use and easy to learn.

'Easy to use' means (usually) easy to read, write, and operate on using tools. Easy to learn means not just learning the language's features, but learning the idioms/patterns/techniques of the language too.

Simpler, more elegant, and smaller are only good proxies for easy to use and learn because our natural tendency is to make languages too big and too complex ("I'll just add this feature and that feature and, oh, the feature is pretty cool too").

Language design is about balance and trade-offs. No one wants to write software in the lambda calculus and not many people want to do it in Cobol. We need to design languages which balance their complexity and size, not striving to be the smallest or simplest, but to be the easiest and most pleasant to use.