When sets are first introduced to students, usually examples are used with finite, explicitly presented sets. For example, #{1,2,3} uu {2,3,4} = {1,2,3,4}#. This is the beginning of the idea that a set is a “collection” of things. Later, when infinite sets are introduced, the idea that sets are “collections” of things is still commonly used as the intuitive basis for the definitions. While I personally find this a bit of a philosophical bait-and-switch, my main issue with it is that I don’t think it does a good job reflecting how we work with sets day-to-day nor for more in-depth set-theoretic investigations. Instead, I recommend thinking about infinite sets as defined by properties and #x in X# for some infinite set #X# means checking whether it satisfies the property defining #X#, not rummaging through the infinite elements that make up #X# and seeing if #x# is amongst them. This perspective closely reflects how we prove things about infinite sets. It makes it much clearer that the job is to find logical relationships between the properties that define sets.
Of course, this view can also be applied to finite sets and should be applied to them. For a constructivist, the notion of “finite set” splits into multiple inequivalent notions, and it is quite easy to show that there is a subset of #{0,1}# which is not “finite” with respect to strong notions of “finite” that are commonly used by constructivists. Today, though, I’ll stick with classical logic and classical set theory. In particular, I’m going to talk about the Internal Set Theory of Edward Nelson, or mostly the small fragment he used in Radically Elementary Probability Theory. In the first chapter of an unfinished book on Internal Set Theory, he states the following:
Perhaps it is fair to say that “finite” does not mean what we have always thought it to mean. What have we always thought it to mean? I used to think that I knew what I had always thought it to mean, but I no longer think so.
While it may be a bit strong to say that Internal Set Theory leads to some question about what “finite” means, I think it makes a good case for questioning what “finite set” means. These concerns are similar to the relativity of the notion of “(un)countable set”.
Minimal Internal Set Theory (minIST) starts by taking all the axiom( schema)s of ZFC as axioms. So right from the get-go we can do “all of math” in exactly the same way we used to. minIST is a conservative extension of ZFC, so a formula of minIST can be translated to a formula of ZFC that is provable if and only if the original formula was provable in minIST.
Before going into what makes minIST different from ZFC, let’s talk about what “finite set” means in ZFC (and thus minIST). There are multiple equivalent characterizations of a set being “finite”. One common definition is a set that has no bijection with a proper subset of itself. A more positive and more useful equivalent definition is a set is finite if and only if it has a bijection with #{k in bbbN | k < n}# for some #n in bbbN#. This latter definition will be handier for our purposes here.
The only new primitive concept minIST adds is the concept of a “standard” natural number. That is, a new predicate symbol is added to the language of ZFC, #sf "St"#
, and #n in bbbN# is standard if and only if #sf "St"(n)#
holds. For convenience, we’ll define the following shorthand: #forall^{:sf "St":}n.P(n)#
for #forall n.sf "St"(n) => P(n)#
. This is very similar to the shorthand: #forall x in X.P(x)# which stands for #forall x.x in X => P(x)#. In fact, if we had a set of standard natural numbers then we could use this more common shorthand, but it will turn out that the existence of a set of standard natural numbers is contradictory.
The four additional axiom( schema)s of minIST are the following:
#sf "St"(0)#
, i.e. #0# is standard.#forall n.sf "St"(n) => sf "St"(n+1)#
, i.e. if #n# is standard, then #n+1# is standard.#P(0) ^^ (forall^{:sf "St":}n.P(n) => P(n+1)) => forall^{:sf "St":} n.P(n)#
for every formula #P#, i.e. we can do induction over standard natural numbers to prove things about all standard natural numbers.#exists n in bbbN.not sf "St"(n)#
, i.e. there exists a natural number that is not standard.That’s it. There is a bit of a subtlety though. The Axiom Schema of Separation (aka the Axiom Schema of Specification or Comprehension) is an axiom schema of ZFC. It is the axiom schema that justifies the notion of set comprehension, i.e. defining a subset of #X# via #{x in X | P(x)}#. This axiom schema is indexed by the formulas of ZFC. When we incorporated ZFC into minIST, we only got the instances of this axiom schema for ZFC formulas, not for minIST formulas. That is, we are only allowed to form #{x in X | P(x)}# in minIST when #P# is a formula of ZFC. Formulas of minIST that are also formulas of ZFC, i.e. that don’t involve (directly or indirectly) #sf "St"#
are called internal formulas. Other formulas are called external. So we can’t use the Axiom Schema of Separation to form the set of standard natural numbers via #{n in bbbN | sf "St"(n)}#
. This doesn’t mean there isn’t some other way of making the set of standard natural numbers. That possibility is excluded by the following proof.
Suppose there is some set #S# such that #n in S <=> sf "St"(n)#
. Using axioms 1 and 2 we can prove that #0 in S# and if #n in S# then #n+1 in S#. The internal form of induction, like all internal theorems, still holds. Internal induction is just axiom 3 but with #forall# instead of #forall^{:sf "St":}#
. Using internal induction, we can immediately prove that #forall n in bbbN.n in S#, but this means #forall n in bbbN.sf "St"(n)#
which directly contradicts axiom 4, thus there is no such #S#. #square#
This negative result is actually quite productive. For example, we can’t have a set #N# of nonstandard natural numbers otherwise we could define #S# as #bbbN \\ N#. This means that if we can prove some internal property holds for all nonstandard naturals, then there must exist a standard natural number for which it holds as well. Properties like this are called overspill because an internal property that holds for the standard natural numbers spills over into the nonstandard natural numbers. Here’s a small example. First, a real number #x# is infinitesimal if and only if there exists a nonstandard natural #nu# such that #|x| <= 1/nu#
. (Note, we can prove that every nonstandard natural is larger than every standard natural.) Two real numbers #x# and #y# are nearly equal, written |x \simeq y|, if and only if #x - y# is infinitesimal. A function #f : bbbR -> bbbR# is (nearly) continuous at #x# if and only if for all #y in bbbR#, if |x \simeq y| then |f(x) \simeq f(y)|. Now if #f# is continuous at #x#, then for every standard natural #n# and #y in bbbR#, #|x - y| <= delta => |f(x) - f(y)| <= 1/n#
holds for all infinitesimal #delta# by continuity. Thus by overspill it holds for some non-infinitesimal #delta#, i.e. for some #delta > 1/m# for some standard #m#. #square#
In Radically Elementary Probability Theory, Nelson reformulates much of probability theory and stochastic process theory by restricting to the finite case. Traditionally, probability theory for finite sample spaces is (relatively) trivial. But remember, “finite” means bijective with #{k in bbbN | k < n}# for some #n in bbbN# and that #n# can be nonstandard in minIST. The notion of “continuous” defined before translates to the usual notion when the minIST formula is translated to ZFC. Similarly, subsets of infinitesimal probability become subsets of measure zero. Sums of infinitesimals over finite sets of nonstandard cardinality become integrals. Little to nothing is lost, instead it is just presented in a different language, a language that often lets you say complex things simply.
Again, I want to reiterate that everything true in ZFC is true in minIST. One of the issues with Robinson-style nonstandard analysis is that the hyperreals are not an Archimedean field. An ordered field #F# is Archimedean if for all #x in F# such that #x > 0# there is an #n in bbbN# such that #nx > 1#. The hyperreals serve as a model of the (min)IST reals. Within minIST, the Archimedean property holds. Even if #x in bbbR# is infinitesimal, there is still a nonstandard #n in bbbN# such that #nx > 1#. But this statement translates to the statement in the meta-language that there is a hypernatural #n# such that for every positive hyperreal #x#, #nx > 1# which does hold. However, if we use the meta-language’s notion of “natural number”, this statement is false.
Full Internal Set Theory (IST) only has three axiom schemas in addition to the axioms of ZFC. Like minIST, there is #sf "St"#
but now we talk about standard sets in general, not just standard natural numbers. The three axiom schemas are as follows:
#forall^{:sf "St":}t_1 cdots forall^{:sf "St":}t_n.(forall^{:sf "St":} x.P(x, t_1, ..., t_n)) <=> forall x.P(x, t_1, ..., t_n)#
for internal #P# with no (other) free variables.#(forall^{:sf "StFin":}X.exists y.forall x in X.P(x, y)) <=> exists y.forall^{:sf "St"} x.P(x, y)#
for internal #P# which may have other free variables.#forall^{:sf "St":}X.exists^{:sf "St"} Y.forall^{:sf "St":}z.(z in Y <=> z in X ^^ P(z))#
for any formula #P# which may have other free variables.These are called the Transfer Principle (T), the Idealization Principle (I), and the Standardization Principle (S) respectively. #forall^{sf "StFin"}X.P(X)#
means for all #X# which are both standard and finite. The Transfer Principle essentially states that an internal property is true for all sets if and only if it is true for all standard sets. The Transfer Principle allows us to prove that any internal property that is satisfied uniquely is satisfied by a standard set. This means everything we can define in ZFC is standard, e.g. the set of reals, the set of naturals, the set of functions from reals to naturals, any particular natural, #pi#, the Riemann sphere.
The Idealization Principle is what allows nonstandard sets to exist. For example, let #P(x, y) -= y in bbbN ^^ (x in bbbN => x < y)#. The Idealization Principle then states that if for any finite set of natural numbers, we can find a natural number greater than all of them, then there is a natural number greater than any standard natural number. Since we clearly can find a natural number greater than any natural number in some given finite set of natural numbers, the premise holds and thus we have a natural number greater than any standard one and thus necessarily nonstandard. In fact, a similar argument can be used to show that all infinite sets have nonstandard elements. A related argument shows that a standard, finite set is exactly a set all of whose elements are standard.
The Standardization Principle isn’t needed to derive minIST nor is it needed for the following result so I won’t discuss it further. Another result derivable from Idealization is: there is a finite set that contains every standard set. To prove it, simply instantiate the Idealization Principle with #P(x,y) -= x in y ^^ y\ "is finite"#
where “is finite” stands for the formalization of any of the definitions of “finite” given before. It is in the discussion after proving this that Nelson makes the statement quoted in the introduction. We can prove some results about this set. First, it can’t be standard itself. Here are three different reasons why: 1) if it were standard, then it would include itself violating the Axiom of Foundation/Regularity; 2) if it were standard, then by Transfer we could prove that it contains all sets again violating the Axiom of Foundation; 3) if it were standard, then by the result mentioned in the previous paragraph, it would contain only standard elements, but then we could intersect it with #bbbN# to get the set of standard naturals which does not exist. Like the fact that there is no smallest nonstandard natural, there is no smallest finite set containing all standard sets.
IST also illustrates another concept. In set theory we often talk about classes. My experience with this concept was a series of poor explanations: a class was “the extent of a predicate”, “the ‘collection’ of entities that satisfy a predicate” and a proper class was a “collection” that was “too big” to be a set like the class of all sets. I muddled on with vague descriptions like this for quite some time before I finally figured out what they meant. Harking back to my first paragraph, a class (in ZFC^{1}) is just a (unary) predicate, i.e. a formula (of ZFC) with one free variable, or at least it can be represented by such^{2}. A formula #P(x)# is a proper class if there is no set #X# such that #forall x.P(x) <=> x in X#. The “class of all sets” is simply the constantly true predicate. This is a proper class because a set #U# such that #forall x.x in U# leads to a contradiction, e.g. Russell’s paradox by warranting #{x in U | x notin x}# via the Axiom of Separation. It may have already dawned on you that #sf "St"(x)#
is a class in (min)IST and, in fact, a proper class. All of those times when I said “there is no set of standard naturals/nonstandard naturals/infinitesimals/standard sets”, I could equally well have said “there is a proper class of standard naturals/nonstandard naturals/infinitesimals/standard sets”. The result of the previous paragraph means the (proper) class of standard sets – far from being “too big” to be a set – is a subclass of (the class induced by) a finite set! Classes were never about “size”^{3}.
To reiterate, IST is a conservative extension of ZFC. Focusing first on the “conservative” part, when we translate the theorem that there is a finite set containing all standard sets to ZFC, it becomes the tautologous statement that every finite set is a subset of some finite set. This starts to reveal what is happening in IST. However, focusing on the “extension” part, nothing in ZFC refutes any of IST. To take a Platonist perspective, these nonstandard sets could “always have been there” and IST just finally lets us talk about them. If you were a Platonist but didn’t want to accept these nonstandard sets, the conservativity result would still allow you to use IST as “just a shorthand” for formulas in ZFC. Either way, it is clear that the notion of “finite set” in ZFC has a lot of wiggle room.
Some set theories have an explicit notion of “class”.↩
The point is that any logically equivalent formula would do, i.e. classes are equivalence classes of formulas under logical equivalence. One could arguably go further and say that the two formulas represent the same class even if we can’t prove that they are logically equivalent as long as we (somehow) know they are “true” for exactly the same things. This extensional view of predicates is what the “extent” stuff is about.↩
On the other hand, the Axiom Schema of Specification states that any subclass of a set is a set in ZFC, so for ZFC no proper class can be contained in a set. In fact, many non-traditional systems can be understood as only allowing certain subclasses of “sets” to be “sets”. For example, in constructive recursive mathematics, we may require a “set” to be recursively enumerable, but we can certainly have subclasses of recursively enumerable sets that are not recursively enumerable. See also this discussion on the Axiom of Choice and “size”.↩
Over the years I’ve seen a lot of confusion about formal logic online. Usually this is from students (but I’ve also seen it with experienced mathematicians) on platforms like the Math or CS StackExchange. Now there is clearly some selection bias where the people asking the questions are the people who are confused, but while the questions about the confusions are common, the confusions are often evident even in questions about other aspects, the confusions are in the (upvoted!) answers, and when you look at the study material they are using it is often completely unsurprising that they are confused. Again, these confusions are also not limited to just “students”, though I’ll use that word in a broad sense below. Finally, on at least one of the points below, the confusion seems to be entirely on the instructors’ side.
To give an indication of the problem, here is my strong impression of what would happen if I gave a student who had just passed a full year introduction to formal logic the following exercise: “Give me a formal proof of #(neg P => neg Q) => (Q => P)#”. I suspect most would draw up a truth table. When I responded, “that’s not a formal proof,” they would be confused. If you are confused by that response, then this is definitely for you. I’m sure they would completely agree with me that if I asked for the inverse matrix of some matrix #M#, showing that the determinant of #M# is non-zero does not constitute an answer to that question. The parallelism of these scenarios would likely be lost on them though. While I think courses that focus on a syntactic approach to logic will produce students that are much more likely to give an appropriate answer to my exercise, that doesn’t mean they lack the confusions, just that this exercise isn’t a good discriminator for them. For example, if they don’t see the parallelism of the two scenarios I described, or worse, have no idea what truth tables have to do with anything, then they have some gap.
As a disclaimer, I am not an educator nor even an academic in any professional capacity.
As a summary, the points I will touch on are:
If anyone reading this is aware of introductory textbooks or other resources (ideally freely available, but I’ll take a reasonably priced textbook too) that avoid most or all of the major issues I list and otherwise do a good job, I would be very interested. My own education on these topics has been piecemeal and eclectic and also strongly dependent on my background in programming. This leaves me with no reasonable recommendations. Please leave a comment or email me, if you have such a recommendation.
The distinction between syntax and semantics is fundamental in logic. Many, many students, on the other hand, seem completely unaware of the distinction. The most blatant form of this is that for many students it seems quite clear that “proving a formula true” means plugging in truth values and seeing that the result is always “true”. Often they (seemingly) think a truth table is a formal proof. They find it almost impossible to state what the difference between provable and “true” is. The impression I get from some is that they think plugging truth values into formulas is what logic is about.
If the students get to classical predicate logic, most of these misunderstandings will be challenged, but students with this misunderstanding will struggle unnecessarily, and you can see this on sites like math.stackexchange.com where questioners present attempts to prove statements of predicate logic by reducing them to propositional statements. It’s not clear to me what typically happens to students with such misunderstandings as they gain proficiency with classical predicate logic. Certainly some clear up their misunderstanding, but since it’s clear that some still have a slippery grasp of the distinction between syntax and semantics, some misunderstanding remains for them. My guess would be that many just view classical propositional logic and classical predicate logic as more different than they are with unrelated approaches to proof. Any instructors reading this can probably describe much stranger rationalizations that students have come up with, and perhaps give some indications of what faulty rationalizations are common.
Speaking of instructors, the pedagogical problem I see here is not that the textbook authors and instructors don’t understand these distinctions, or even that they don’t present these distinctions, but that they don’t continually emphasize these distinctions. Maybe they do in the classroom, but going by the lecture notes or books, the notation and terminology that many use make it very easy to conflate syntax and semantics. For understandable reasons, proofs of the soundness and completeness theorems that are necessary to justify this conflation are pushed off until much later.
I googled lecture notes "classical propositional logic"
and looked at the first hit I got. It clearly has a section on semantics and multiple sections on a (syntactic) deductive system “helpfully” named “semantic tableaux”. It bases all concepts on truth tables introduced in section 4. Later concepts are justified by corresponding to the truth table semantics. Here is how it describes the semantics for negation:
A negation #neg A# is true (in any situation) if and only if #A# is not true (in that situation).
This is followed by the sentence: “For the moment, we shall suppress the qualifier ‘in any situation’[…]”, and we get the rules for the other connectives starting with the one for conjunction:
A conjunction #A ^^ B# is true if and only if #A# is true and #B# is true.
It later explicates the “in any situation” by talking about truth-value assignments and giving “rules” like:
#A ^^ B# is true in an assignment iff #A# and #B# are true in that assignment.
Between these two renditions of the rule for conjunction, the notes tell us what a “truth-value” is and seemingly what “is true” means with the following text:
In classical propositional logic, formulas may have only one of two possible “truth values”. Each formula must be either true or false; and no formula may be both true and false. These values are written T and F. (Some authors use #1# and #0# instead. Nothing hinges on this.)
The description so far makes it sound like logical formula are “expressions” which “compute” to true or false, which will also be written T or F, like arithmetical expressions. This is made worse in section 11 where the connectives are identified with truth-functions (which is only mentioned in this early section). It would be better to have a notation to distinguish a formula from its truth value e.g. #v(A)# or, the notation I like, |[\! [A]\!]|. (As I’ll discuss in the next section, it would also be better to avoid using philosophically loaded terms like “true” and “false”. #1# and #0# are better choices, but if you really wanted to drive the point home, you could use arbitrary things like “dog” and “cat”. As the authors state, “nothing hinges on this”.) Rewording the above produces a result that I believe is much harder to misunderstand.
#v(A ^^ B)=1# if and only if #v(A)=1# and #v(B)=1#.
This is likely to be perceived as less “intuitive” to students, but I strongly suspect the seeming cognitive ease was masking actually grappling with what is being said in the earlier versions of the statement. If we actually want to introduce truth functions, I would highly recommend using different notation for them, for example:
#v(A ^^ B)=sf "and"(v(A),v(B))#.
Of course, we can interpret “#A# is true” as defining a unary predicate on formulas, “_ is true”, but I seriously doubt this is where most students’ minds jump to when being introduced to formal logic, and this view is undermined by the notion of “truth values”. Again, I’m not saying the authors are saying things that are wrong, I’m saying that they make it very easy to misunderstand what is going on.
Following this, most of the remainder of the notes discusses the method of semantic tableaux. It mentions soundness and completeness but also states: “In this introductory course, we do not expect you to study the proof of these two results[.]”) At times it suggests that the tableaux method can stand on its own: “However, we can show that it is inconsistent without using a truth table, by a form of deductive reasoning. That is, by following computer-programmable inference rules.” But the method is largely presented as summarizing semantic arguments: “This rule just summarizes the information in the truth table that…” In my ideal world, this more deductive approach would be clearly presented as an alternative approach to thinking about logic with the inference rules not explained in terms of semantic notions but related to them after the fact.
My point isn’t to pick on these notes which seem reasonable enough overall. They have some of the other issues I’ll cover but also include some of the things I’d like to see in such texts.
As mentioned in the previous section, words like “true” and “false” have a huge philosophical weight attached to them. It is quite easy to define any particular logic either syntactically or semantically without even suggesting the identification of anything with the philosophical notion of “truth”. My point isn’t that philosophy should be banished from a mathematical logic course, though that is an option, but that we can separate the definition of a logical system with how it relates to the intuitive or philosophical notion of “truth”. This allows us to present logics as the mathematical constructions they are and not feel the need to justify why any given rules/semantics really are the “right” ones. Another way of saying it is that your philosophical understanding/stance doesn’t change the definition of classical propositional logic. It just changes whether that logic is a good reflection of your philosophical views.
(As an aside, while I did read introductions to logic early on, the beginnings of my in-depth understanding of logic arose via programming language theory and type systems [and then category theory]. These areas often cover the same concepts and systems but with little to no philosophizing. Papers discuss classical and constructive logics with no philosophical commitment for or against the law of excluded middle being remotely implied.)
Of course, we do want to leverage our intuitive understanding of logic to motivate either the inference rules or the chosen semantics, but we can phrase the definitions as being inspired by the intuition rather than codifying it. Things like the common misgivings students (and others) have about material implication are a lot easier to deal with when logic isn’t presented as “The One True Formulation of Correct Reasoning”. Of course another way to undermine this impression would be to present a variety of logics including ones where material implication doesn’t hold. This is also confounded by unnecessary philosphizing. It makes it seem like non-classical logics are only of interest to crazy contrarians who reject “self-evident” “truths” like every statement is either “true” or “false”. While there definitely is a philosophical discussion that could be had here, there are plenty of mathematical reasons to be interested in non-classical logics regardless of your philosophical stance on truth. Non-classical logics can also shed a lot of light on what’s going on in classical logics.
This assumes non-classical logics are even mentioned at all.
I’m highly confident you can go through many introductory courses on logic and never have any inkling that there are logics other than classical logics. I suspect you can easily go through an entire math major and also be completely unaware of non-classical logics. For an introductory course on logic, I completely understand that time constraints can make it difficult to include any substantial discussion of non-classical logics. I somewhat less forgiving when it comes to textbooks.
What bothers me isn’t that non-classical logics aren’t given equal time - I wouldn’t expect them to be - it’s that classical logic is presented as a comprehensive picture of logic. This is likely, to some degree, unintentional, but it is easily fixable. First, at the beginning of the course, simply mention that the course will cover classical logics and there are other logics like constructive or paraconsistent logics that have different rules and different notions of semantics. Use less encompassing language. Instead of saying “logic”, say “a logic” or “the logic we’re studying”. Similarly, talk about “a semantics” rather than “the semantics”, and “a set of rules” rather than “the rules”. Consistently say “classical propositional/first-order/predicate logic” rather than just “propositional/first-order/predicate logic” or worse just “logic”. Note points where non-classical logics make different choices when the topics come up. For example, that truth tables are a semantics for classical propositional logic while non-classical logics will use a different notion of semantics. The law of excluded middle, indirect proof, and material implication are other areas where a passing mention of how different logics make different decisions about these topics could be made. The principle of explosion, related to the issues with material implication, is an interesting point where classical and constructive logics typically agree while paraconsistent logics differ. Heck, I suspect some people find my use of the plural, “logics”, jarring, as if there could be more than one logic. Fancy that!
The goal with this advice is just to indicate to students that there is a wider world of logic out there. I don’t expect students to be quizzed on this. Some of this advice is relevant even purely for classical logics. It’s not uncommon for people to ask questions on the Math StackExchange asking for a proof of some formula. When asked what rules they are using, they seem confused: “The rules of predicate logic(?!)” The idea that the presentation of (classical) predicate logic that they were given is only one of many possibilities seems never to have occurred to them. Similarly for semantics. So generally when a choice is made, it would seem of long-term pedagogical value to at least mention that a choice has been made and that other choices are possible.
Nearly any week on the Math StackExchange you can find someone asking about an exercise like the following:
Translate “Jan likes Sally but not Alice” to a logical formula.
Exercises like these are just a complete and utter waste of time. The only people who might do something like this are linguists. Even then the “linguistics” presented in introductions to formal logic is (understandably) ridiculously naive, and an introductory course on formal logic is simply not the place for it anyway. What the vast majority of consumers of formal logic do need is to 1) be able to understand what formal expressions mean, and 2) be able to present their own thoughts formally. Even when formalizing another person’s informal proof, the process is one of understanding what the person was saying and then writing down your understanding in formal language. Instead what’s presented is rules-of-thumb like “‘but’ means ‘and’”.
I could understand having an in-class discussion where the instructor presents natural language statements and asks students interactively to provide various interpretations of the statements. The ambiguity in natural language should quickly become apparent and this motivates the introduction of unambiguous formal notation. What makes no sense to me is giving students sheets of natural language statements about everyday situations that they need to “formalize” and grading them on this. Even if this was valuable, which I don’t think it is, the opportunity cost of spending time on this rather than some other topic would make it negatively valued to me.
What is fine is telling students how to read formal notation in natural language as long as it’s clear that these readings are not definitions. I have seen people claim that e.g. #A => B# is defined as “if #A#, then #B#”, but this is likely a misunderstanding on their part and not what their instructors or books said. On the other hand, it doesn’t seem too rare for a textbook to present a logical connective and then explicate it in terms of natural language examples shortly before or after providing a more formal definition (via rules or semantics). I don’t find it completely surprising that some students confuse these examples as definitions especially when the actual definitions, at times, don’t look terribly different from schematic phrases of natural languages.
I think natural language examples, and especially everyday as opposed to mathematical examples, should largely be avoided, or at the very least should be used with care.
Negation introduction is the following rule: if by assuming #P# we can produce a contradiction, then we can conclude #neg P#. You could present this as an axiom: #(P => _|_) => neg P#
.
Double negation elimination is the following rule: if by assuming #neg P# we can produce a contradiction, then we can conclude #P#. You could present this as an axiom: #(neg P => _|_) => P#
or #neg neg P => P#.
It seems pretty evident that many experienced mathematicians, let alone instructors and students, don’t see any difference between these. “Proof by contradiction” or “reductio ad absurdum” are often used to ambiguously refer to either of the above rules. This has been discussed a few times. When even the likes of Timothy Gowers does this, I think it is safe to say many others do as well. Or just look at the comments to Andrej’s article which I highly recommend.
As discussed in the referenced articles, and evidenced in the comments, there seems to be two things going on here. First, it seems that many people believe that dropping double negations, i.e. treating #neg neg P# and #P# is just a thing that you can do. They incorrectly reason as follows: substitute #neg P# into #(P => _|_) => neg P#
to derive #(neg P => _|_) => neg neg P#
which is the same as #(neg P => _|_) => P#
. Of course, double negation elimination, the thing they are trying to show follows from negation introduction, is what lets them do that last step in the first place! It’s clear that negation introduction only adds (introduces, if you will) negations and can’t eliminate them. The second is that any informal proof of the form “Suppose #P# … hence a contradiction” is called a “proof by contradiction”.
As alluded to in the articles, people develop and state false beliefs about constructive logics because they are unaware of this distinction. This is an excellent area where exploring non-classical logics can be hugely clarifying. The fact that negation introduction is not only derivable but often axiomatic in constructive logics, while adding double negation elimination to a constructive logic makes it classical, very clearly illustrates that there is a huge difference between these rules. But forget about constructive logics, what’s happening here is many mathematicians are deeply confused about the key thing that makes classical logics classical. Further, students are being taught this confusion. Clearly there is some pedagogical failure occurring here.
A major cause of the above conflation of negation introduction and double negation elimination is the overuse and overemphasis of indirect proof. This causes more problems than just the aforementioned conflation. For example, it’s relatively common to see proofs by students that are of the form: “Assume #not P#. We can show that #P# holds via [proof of #P# without using the assumption #not P#]. Hence we have #P# and #not P# and thus a contradiction. Therefore #P#.” This is, of course, silly. The indirect proof by contradiction is just a wrapper around a direct proof of #P#. To make this completely clear, the logical structure mirrors the logical structure of the following argument: “Assume for contradiction #1+1 != 2#. Since by calculation #1+1=2# we have a contradiction. Thus #1+1=2#.”
It’s not always this egregious. Often the use indirect proof isn’t completely superfluous like the previous examples, but there is nevertheless a direct proof which is clearer, more informative, more general, and shorter. If these proofs are clearer and shorter, why don’t students find them? I see a variety of reasons for this. First, indirect proof is conceptually more complicated so more effort is spent explaining it, more examples using it are given, and its use is encouraged for practice. Second, the proof system provided is often intimately based on classical propositional logic (e.g. the “semantic tableaux” above or resolution) and often essentially requires the use of indirect proof to prove anything. For example, if we take resolution as the primitive rule, i.e. if #P vv Q# and #not Q# then #P#, then doing a case analysis requires an indirect proof in general since clearly both #P# and #Q# could hold so we can only hypothetically assume #not Q#. (If you want a more formal argument, see this question and its answers.) Relying on double negation elimination to get negation introduction is another example. It’s common in less formal approaches to effectively take all classical propositional tautologies as axioms (which blurs the distinction between syntax and semantics). Many equivalences or definitions also hide uses of indirect proof (if we were to demonstrate them with respect to a constructive logic). For example, #P => Q -= not P vv Q#, #not not P -= P#, #exists x. P(x) -= not forall x.not P(x)#, and #P vv Q -= not (not P ^^ not Q)#. Third, often definitions and theorems are given in a “negative” way so that their negations or contrapositives are actually more useful. For example, defining “injective” as #f(x)=f(y) => x = y# is a little more convenient than #x != y => f(x) != f(y)# e.g. when proving an injective function has a post-inverse. Last but not least, it certainly can be harder, up to and including impossible, to prove a statement directly (i.e. constructively). That extra information doesn’t come from nowhere. As an example, there is a simple non-constructive proof that either #sqrt(2)^sqrt(2)# or #(sqrt(2)^sqrt(2))^sqrt(2)#
is a rational number that’s of the form of an irrational number raised to an irrational power, but which it is is much harder to show.
What I’d recommend is for instructors and authors to strongly encourage direct/constructive proofs suggesting that a direct proof should be attempted before reaching for indirect proof. Perhaps even require direct/constructive proofs unless the exercise specifies otherwise. This, of course, requires a proof system that doesn’t force the use of indirect proofs. I’d recommend a natural deduction or sequent style presentation where each logical connective is specified in terms of rules that are independent of the other connectives. (I’m not saying that the formal sequent calculus notation needs to be used, just that there should be a rule, formal or informal, for each rule in the usual presentation of natural deduction, say.) As in the earlier section, I strongly recommend keeping syntax and semantics separate: the proof system should be able to stand on its own. Definitions and theorems should be given in forms that are constructively-friendly, which, admittedly, can take a good amount of care. To avoid students spending a lot of time looking for simple direct proofs that don’t exist, give hints where indirect proof is or is not required. While it can be helpful to motivate the significance of direct proofs and why certain choices were made, there’s no need to explicitly reference constructive logics for any of this.
Finally, here are some minor issues that either seem less common or less problematic as well as some potential missed opportunities.
The first is a failure of abstraction I’ve occasionally seen, and I think certain books encourage. I’ve seen students who talk about the “symbols” of their logic as “including …, '('
, and ')'
”. While certainly how to parse strings into syntax trees is a core topic of formal languages, it’s not that important for formal logic. The issues with parsing formulas I see have to do with conventions on omitting parentheses, not an inability to match them. There can be some benefit to driving home that syntax is just inert symbols, but the concept of a syntax tree seems more important and the process of producing a syntax tree from a string of symbols rather secondary. Working at the level of strings adds a lot of complexity and offers little insight into the logical concepts.
As a missed opportunity, there’s a lot of algorithmic content in formal logic. This should come as no surprise as the mechanization of reasoning has been a long time dream of mathematics and was an explicit goal during the birth of formal logic. Unfortunately, if the instructors and/or students don’t have a programming background it can be hard to really leverage this. Still, my impression is that the only indication of the mechanizability of much of logic that students get is in the mechanical drudgery of doing certain exercises. Even when it’s not possible to actually implement some of the relevant ideas, they can still be mentioned. For example, it is often taken as a requirement of a proof system that the proofs can be mechanically verified. That is, we can write a program that takes a formal proof and tells you whether it is a valid proof or not. This should be at least mentioned. Even better, for more than just this reason, it can be experienced by using proof assistants like Coq or MetaMath or simpler things like Logitext. There are other areas of mechanization. Clearly, we can mechanize the process of checking truth tables for classical propositional logic, though there are interesting efficiency aspects to mention here. The completeness theorem of classical propositional logic is also constructively provable producing an algorithm that will take a formula with a truth table showing its validity and will produce a formal proof of that formula. As a little more open-ended topic, much of proof search is mechanical. I’ve often seen students stumped by problems whose proofs can be found in a completely goal-directed manner without any guessing at all.
Another issue is the use of Hilbert-style presentations of logics. It does seem that many logic texts thankfully use natural deduction or some other humane proof system, but Hilbert-style presentations aren’t uncommon. The benefit of Hilbert-style presentations of logics is that they can simplify meta-logical reasoning. But many introductions to logic do little to no meta-logical reasoning. The downsides of Hilbert-style presentations is that they don’t match informal reasoning well, they don’t provide much insight into the logical connectives, and proofs in Hilbert-style proof systems are just painful where even simple results are puzzles. Via the lens of Curry-Howard, Hilbert-style proof systems correspond to combinatory algebras, and writing programs in combinator systems (such as unlambda) is similarly unpleasant. The fact that introducing logic using a Hilbert-style proof system is formally analogous to introducing programming using an esoteric programming language designed to be obfuscatory says something…
]]>This will be a very non-traditional introduction to the ideas behind category theory. It will essentially be a slice through model theory (presented in a more programmer-friendly manner) with an unusual organization. Of course the end result will be ***SPOILER ALERT*** it was category theory all along. A secret decoder ring will be provided at the end. This approach is inspired by the notion of an internal logic/language and by Vaughn Pratt’s paper The Yoneda Lemma Without Category Theory.
I want to be very clear, though. This is not meant to be an analogy or an example or a guide for intuition. This is category theory. It is simply presented in a different manner.
Dan Doel pointed me at some draft lecture notes by Mike Shulman that seem very much in the spirit of this blog post (albeit aimed at an audience familiar with category theory): Categorical Logic from a Categorical Point of View. A theory in my sense corresponds to a category presentation (Definition 1.7.1) as defined by these lecture notes. Its oft-mentioned Appendix A will also look very familiar.
The first concept we’ll need is that of a theory. If you’ve ever implemented an interpreter for even the simplest language, then most of what follows modulo some terminological differences should be both familiar and very basic. If you are familiar with algebraic semantics, then that is exactly what is happening here only restricting to unary (but multi-sorted) algebraic theories.
For us, a theory, #ccT#, is a collection of sorts, a collection of (unary) function symbols^{1}, and a collection of equations. Each function symbol has an input sort and an output sort which we’ll call the source and target of the function symbol. We’ll write #ttf : A -> B# to say that #ttf# is a function symbol with source #A# and target #B#. We define #"src"(ttf) -= A#
and #"tgt"(ttf) -= B#
. Sorts and function symbols are just symbols. Something is a sort if it is in the collection of sorts. Nothing else is required. A function symbol is not a function, it’s just a, possibly structured, name. Later, we’ll map those names to functions, but the same name may be mapped to different functions. In programming terms, a theory defines an interface or signature. We’ll write #bb "sort"(ccT)#
for the collection of sorts of #ccT# and #bb "fun"(ccT)#
for the collection of function symbols.
A (raw) term in a theory is either a variable labelled by a sort, #bbx_A#, or it’s a function symbol applied to a term, #tt "f"(t)#
, such that the sort of the term #t# matches the source of #ttf#. The sort or target of a term is the sort of the variable if it’s a variable or the target of the outermost function symbol. The source of a term is the sort of the innermost variable. In fact, all terms are just sequences of function symbol applications to a variable, so there will always be exactly one variable. All this is to say the expressions need to be “well-typed” in the obvious way. Given a theory with two function symbols #ttf : A -> B# and #ttg : B -> A#, #bbx_A#
, #bbx_B#
, #tt "f"(bbx_A)#
, and #tt "f"(tt "g"(tt "f"(bbx_A)))#
are all examples of terms. #tt "f"(bbx_B)#
and #tt "f"(tt "f"(bbx_A))#
are not terms because they are not “well-typed”, and #ttf# by itself is not a term simply because it doesn’t match the syntax. Using Haskell syntax, we can define a data type representing this syntax if we ignore the sorting:
Using GADTs, we could capture the sorting constraints as well:
data Term (s :: Sort) (t :: Sort) where
Var :: Term t t
Apply :: FunctionSymbol x t -> Term s x -> Term s t
An important operation on terms is substitution. Given a term #t_1# with source #A# and a term #t_2# with target #A# we define the substitution of #t_2# into #t_1#, written #t_1[bbx_A |-> t_2]#, as:
If #t_1 = bbx_A# then #bbx_A[bbx_A |-> t_2] -= t_2#.
If
#t_1 = tt "f"(t)#
then#tt "f"(t)[bbx_A |-> t_2] -= tt "f"(t[bbx_A |-> t_2])#
.
Using the theory from before, we have:
#tt "f"(bbx_A)[bbx_A |-> tt "g"(bbx_B)] = tt "f"(tt "g"(bbx_B))#
As a shorthand, for arbitrary terms #t_1# and #t_2#, #t_1(t_2)# will mean #t_1[bbx_("src"(t_1)) |-> t_2]#
.
Finally, equations^{2}. An equation is a pair of terms with equal source and target, for example, #(: tt "f"(tt "g"(bbx_B)), bbx_B :)#
. The idea is that we want to identify these two terms. To do this we quotient the set of terms by the congruence generated by these pairs, i.e. by the reflexive-, symmetric-, transitive-closure of the relation generated by the equations which further satisfies “if #s_1 ~~ t_1# and #s_2 ~~ t_2# then #s_1(s_2) ~~ t_1(t_2)#
”. From now on, by “terms” I’ll mean this quotient with “raw terms” referring to the unquotiented version. This means that when we say “#tt "f"(tt "g"(bbx_B)) = bbx_B#
”, we really mean the two terms are congruent with respect to the congruence generated by the equations. We’ll write #ccT(A, B)# for the collection of terms, in this sense, with source #A# and target #B#. To make things look a little bit more normal, I’ll write #s ~~ t# as a synonym for #(: s, t :)# when the intent is that the pair represents a given equation.
Expanding the theory from before, we get the theory of isomorphisms, #ccT_{:~=:}#
, consisting of two sorts, #A# and #B#, two function symbols, #ttf# and #ttg#, and two equations #tt "f"(tt "g"(bbx_B)) ~~ bbx_B#
and #tt "g"(tt "f"(bbx_A)) ~~ bbx_A#
. The equations lead to equalities like #tt "f"(tt "g"(tt "f"(bbx_A))) = tt "f"(bbx_A)#
. In fact, it doesn’t take much work to show that this theory only has four distinct terms: #bbx_A#, #bbx_B#, #tt "f"(bbx_A)#
, and #tt "g"(bbx_B)#
.
In traditional model theory or universal algebra we tend to focus on multi-ary operations, i.e. function symbols that can take multiple inputs. By restricting ourselves to only unary function symbols, we expose a duality. For every theory #ccT#, we have the opposite theory, #ccT^(op)# defined by using the same sorts and function symbols but swapping the source and target of the function symbols which also requires rewriting the terms in the equations. The rewriting on terms is the obvious thing, e.g. if #ttf : A -> B#, #ttg : B -> C#, and #tth : C -> D#, then the term in #ccT#, #tt "h"(tt "g"(tt "f"(bbx_A)))#
would become the term #tt "f"(tt "g"(tt "h"(bbx_D)))#
in #ccT^(op)#. From this it should be clear that #(ccT^(op))^(op) = ccT#
.
Given two theories #ccT_1# and #ccT_2# we can form a new theory #ccT_1 xx ccT_2# called the product theory of #ccT_1# and #ccT_2#. The sorts of this theory are pairs of sorts from #ccT_1# and #ccT_2#. The collection of function symbols is the disjoint union #bb "fun"(ccT_1) xx bb "sort"(ccT_2) + bb "sort"(ccT_1) xx bb "fun"(ccT_2)#
. A disjoint union is like Haskell’s Either
type. Here we’ll write #tt "inl"#
and #tt "inr"#
for the left and right injections respectively. #tt "inl"#
takes a function symbol from #ccT_1# and a sort from #ccT_2# and produces a function symbol of #ccT_1 xx ccT_2# and similarly for #tt "inr"#
. If #tt "f" : A -> B#
in #ccT_1# and #C# is a sort of #ccT_2#, then #tt "inl"(f, C) : (A, C) -> (B, C)#
and similarly for #tt "inr"#
.
The collection of equations for #ccT_1 xx ccT_2# consists of the following:
#tt "inl"(tt "f", C)#
#tt "inl"(tt "f", D)(tt "inr"(A, tt "g")(bbx_{:(A, C")":})) ~~ tt "inr"(B, tt "g")("inl"(tt "f", C)(bbx_{:(A, C")":}))#
The above is probably unreadable. If you work through it, you can show that every term of #ccT_1 xx ccT_2# is equivalent to a pair of terms #(t_1, t_2)# where #t_1# is a term in #ccT_1# and #t_2# is a term in #ccT_2#. Using this equivalence, the first bullet is seen to be saying that if #l = r# in #ccT_1# and #C# is a sort in #ccT_2# then #(l, bbx_C) = (r, bbx_C)# in #ccT_1 xx ccT_2#. The second is similar. The third then states
#(t_1, bbx_C)((bbx_A, t_2)(bbx_{:"(A, C)":})) = (t_1, t_2)(bbx_{:"(A, C)":}) = (bbx_A, t_2)((t_1, bbx_C)(bbx_{:"(A, C)":}))#
.
To establish the equivalence between terms of #ccT_1 xx ccT_2# and pairs of terms from #ccT_1# and #ccT_2#, we use the third bullet to move all the #tt "inl"#
s outward at which point we’ll have a sequence of #ccT_1# function symbols followed by a sequence of #ccT_2# function symbols each corresponding to term.
The above might seem a bit round about. An alternative approach would be to define the function symbols of #ccT_1 xx ccT_2# to be all pairs of all the terms from #ccT_1# and #ccT_2#. The problem with this approach is that it leads to an explosion in the number of function symbols and equations required. In particular, it easily produces an infinitude of function symbols and equations even when provided with theories that only have a finite number of sorts, function symbols, and equations.
As a concrete and useful example, consider the theory #ccT_bbbN# consisting of a single sort, #0#, a single function symbol, #tts#, and no equations. This theory has a term for each natural number, #n#, corresponding to #n# applications of #tts#. Now let’s articulate #ccT_bbbN xx ccT_bbbN#. It has one sort, #(0, 0)#, two function symbols, #tt "inl"(tt "s", 0)#
and #tt "inr"(0, tt "s")#
, and it has one equation, #tt "inl"(tt "s", 0)(tt "inr"(0, tt "s")(bbx_{:(0, 0")":})) ~~ tt "inr"(0, tt "s")(tt "inl"(tt "s", 0)(bbx_{:(0, 0")":}))#
. Unsurprisingly, the terms of this theory correspond to pairs of natural numbers. If we had used the alternative definition, we’d have had an infinite number of function symbols and an infinite number of equations.
Nevertheless, for clarity I will typically write a term of a product theory as a pair of terms.
As a relatively easy exercise — easier than the above — you can formulate and define the disjoint sum of two theories #ccT_1 + ccT_2#. The idea is that every term of #ccT_1 + ccT_2# corresponds to either a term of #ccT_1# or a term of #ccT_2#. Don’t forget to define what happens to the equations.
Related to these, we have the theory #ccT_{:bb1:}#, which consists of one sort and no function symbols or equations, and #ccT_{:bb0:}# which consists of no sorts and thus no possibility for function symbols or equations. #ccT_{:bb1:}# has exactly one term while #ccT_{:bb0:}# has no terms.
Sometimes we’d like to talk about function symbols whose source is in one theory and target is in another. As a simple example, that we’ll explore in more depth later, we may want function symbols whose sources are in a product theory. This would let us consider terms with multiple inputs.
The natural way to achieve this is to simply make a new theory that contains sorts from both theories plus the new function symbols. A collage, #ccK#, from a theory #ccT_1# to #ccT_2#, written #ccK : ccT_1 ↛ ccT_2#, is a theory whose collection of sorts is the disjoint union of the sorts of #ccT_1# and #ccT_2#. The function symbols of #ccK# consist for each function symbol #ttf : A -> B# in #ccT_1#, a function symbol #tt "inl"(ttf) : tt "inl"(A) -> tt "inl"(B)#
, and similarly for function symbols from #ccT_2#. Equations from #ccT_1# and #ccT_2# are likewise taken and lifted appropriately, i.e. #ttf# is replaced with #tt "inl"(ttf)#
or #tt "inr"(ttf)#
as appropriate. Additional function symbols of the form #k : tt "inl"(A) -> tt "inr"(Z)#
where #A# is a sort of #ccT_1# and #Z# is a sort of #ccT_2#, and potentially additional equations involving these function symbols, may be given. (If no additional function symobls are given, then this is exactly the disjoint sum of #ccT_1# and #ccT_2#.) These additional function symbols and equations are what differentiate two collages that have the same source and target theories. Note, there are no function symbols #tt "inr"(Z) -> tt "inl"(A)#
, i.e. where #Z# is in #ccT_2# and #A# is in #ccT_1#. That is, there are no function symbols going the “other way”. To avoid clutter, I’ll typically assume that the sorts and function symbols of #ccT_1# and #ccT_2# are disjoint already, and dispense with the #tt "inl"#
s and #tt "inr"#
s.
Summarizing, we have #ccK(tt "inl"(A), "inl"(B)) ~= ccT_1(A, B)#
, #ccK(tt "inr"(Y), tt "inr"(Z)) ~= ccT_2(Y, Z)#
, and #ccK(tt "inr"(Z), tt "inl"(A)) = O/#
for all #A#, #B#, #Y#, and #Z#. #ccK(tt "inl"(A), tt "inr"(Z))#
for any #A# and #Z# is arbitrary generated. To distinguish them, I’ll call the function symbols that go from one theory to another bridges. More generally, an arbitrary term that has it’s source in one theory and target in another will be described as a bridging term.
Here’s a somewhat silly example. Consider #ccK_+ : ccT_bbbN xx ccT_bbbN ↛ ccT_bbbN# that has one bridge #tt "add" : (0, 0) -> 0#
with the equations #tt "add"(tt "inl"(tts, 0)(bbx_("("0, 0")"))) ~~ tts(tt "add"(bbx_("("0, 0")")))#
and #tt "add"(tt "inr"(0, tts)(bbx_("("0, 0")"))) ~~ tts(tt "add"(bbx_("("0, 0")")))#
.
More usefully, if a bit degenerately, every theory induces a collage in the following way. Given a theory #ccT#, we can build the collage #ccK_ccT : ccT ↛ ccT# where the bridges consist of the following. For each sort, #A#, of #ccT#, we have the following bridge: #tt "id"_A : tt "inl"(A) -> tt "inr"(A)#
. Then, for every function symbol, #ttf : A -> B# in #ccT#, we have the following equation: #tt "inl"(tt "f")(tt "id"_A(bbx_(tt "inl"(A)))) ~~ tt "id"_B(tt "inr"(tt "f")(bbx_(tt "inl"(A))))#
. We have #ccK_ccT(tt "inl"(A), tt "inr"(B)) ~= ccT(A, B)#
.
You can think of a bridging term in a collage as a sequence of function symbols partitioned into two parts by a bridge. Naturally, we might consider partitioning into more than two parts by having more than one bridge. It’s easy to generalize the definition of collage to combine an arbitrary number of theories, but I’ll take a different, probably worse, route. Given collages #ccK_1 : ccT_1 ↛ ccT_2# and #ccK_2 : ccT_2 ↛ ccT_3#, we can make the collage #ccK_2 @ ccK_1 : ccT_1 ↛ ccT_3# by defining its bridges to be triples of a bridge of #ccK_1#, #k_1 : A_1 -> A_2#, a term, #t : A_2 -> B_2# of #ccT_2#, and a bridge of #ccK_2#, #k_2 : B_2 -> B_3# which altogether will be a bridge of #ccK_2 @ ccK_1# going from #A_1 -> B_3#. These triples essentially represent a term like #k_2(t(k_1(bbx_(A_1))))#
. With this intuition we can formulate the equations. For each equation #t'(k_1(t_1)) ~~ s'(k'_1(s_1))#
where #k_1# and #k'_1#
are bridges of #ccK_1#, we have for every bridge #k_2# of #ccK_2# and term #t# of the appropriate sorts #(k_2, t(t'(bbx)), k_1)(t_1) ~~ (k_2, t(s'(bbx)), k'_1)(s_1)#
and similarly for equations involving the bridges of #ccK_2#.
This composition is associative… almost. Furthermore, the collages generated by theories, #ccK_ccT#, behave like identities to this composition… almost. It turns out these statements are true, but only up to isomorphism of theories. That is, #(ccK_3 @ ccK_2) @ ccK_1 ~= ccK_3 @ (ccK_2 @ ccK_1)# but is not equal.
To talk about isomorphism of theories we need the notion of…
An interpretation of a theory gives meaning to the syntax of a theory. There are two nearly identical notions of interpretation for us: interpretation (into sets) and interpretation into a theory. I’ll define them in parallel. An interpretation (into a theory), #ccI#, is a mapping, written #⟦-⟧^ccI# though the superscript will often be omitted, which maps sorts to sets (sorts) and function symbols to functions (terms). The mapping satisfies:
#⟦"src"(f)⟧ = "src"(⟦f⟧)#
and#⟦"tgt"(f)⟧ = "tgt"(⟦f⟧)#
where#"src"#
and#"tgt"#
on the right are the domain and codomain operations for an interpretation.
We extend the mapping to a mapping on terms via:
#⟦bbx_A⟧ = x |-> x#, i.e. the identity function, or, for interpretation into a theory, `#⟦bbx_A⟧ = bbx_{:⟦A⟧:}#`{.asciimath}
`#⟦tt "f"(t)⟧ = ⟦tt "f"⟧ @ ⟦t⟧#`{.asciimath} or, for interpretation into a theory, `#⟦tt "f"(t)⟧ = ⟦tt "f"⟧(⟦t⟧)#`{.asciimath}
and we require that for any equation of the theory, #l ~~ r#, #⟦l⟧ = ⟦r⟧#. (Technically, this is implicitly required for the extension of the mapping to terms to be well-defined, but it’s clearer to state it explicitly.) I’ll write #ccI : ccT -> bb "Set"#
when #ccI# is an interpretation of #ccT# into sets, and #ccI’ : ccT_1 -> ccT_2# when #ccI’# is an interpretation of #ccT_1# into #ccT_2#.
An interpretation of the theory of isomorphisms produces a bijection between two specified sets. Spelling out a simple example where #bbbB# is the set of booleans:
#⟦A⟧ -= bbbB#
#⟦B⟧ -= bbbB#
`#⟦tt "f"⟧ -= x |-> not x#`{.asciimath}
`#⟦tt "g"⟧ -= x |-> not x#`{.asciimath}
plus the proof #not not x = x#.
As another simple example, we can interpret the theory of isomorphisms into itself slightly non-trivially.
#⟦A⟧ -= B#
#⟦B⟧ -= A#
`#⟦tt "f"⟧ -= tt "g"(bbx_B)#`{.asciimath}
`#⟦tt "g"⟧ -= tt "f"(bbx_A)#`{.asciimath}
As an (easy) exercise, you should define #pi_1 : ccT_1 xx ccT_2 -> ccT_1# and similarly #pi_2#. If you defined #ccT_1 + ccT_2# before, you should define #iota_1 : ccT_1 -> ccT_1 + ccT_2# and similarly for #iota_2#. As another easy exercise, show that an interpretation of #ccT_{:~=:}#
is a bijection. In Haskell, an interpretation of #ccT_bbbN# would effectively be foldNat
. Something very interesting happens when you consider what an interpretation of the collage generated by a theory, #ccK_ccT#, is. Spell it out. In a different vein, you can show that a collage #ccK : ccT_1 ↛ ccT_2# and an interpretation #ccT_1^(op) xx ccT_2 -> bb "Set"#
are essentially the same thing in the sense that each gives rise to the other.
Two theories are isomorphic if there exists interpretations #ccI_1 : ccT_1 -> ccT_2#
and #ccI_2 : ccT_2 -> ccT_1#
such that #⟦⟦A⟧^(ccI_1)⟧^(ccI_2) = A#
and vice versa, and similarly for function symbols. In other words, each is interpretable in the other, and if you go from one interpretation and then back, you end up where you started. Yet another way to say this is that there is a one-to-one correspondence between sorts and terms of each theory, and this correspondence respects substitution.
As a crucially important example, the set of terms, #ccT(A, B)#, can be extended to an interpretation. In particular, for each sort #A#, #ccT(A, -) : ccT -> bb "Set"#
. It’s action on function symbols is the following:
#⟦tt "f"⟧^(ccT(A, -)) -= t |-> tt "f"(t)#
We have, dually, #ccT(-, A) : ccT^(op) -> bb "Set"#
with the following action:
#⟦tt "f"⟧^(ccT(-, A)) -= t |-> t(tt "f"(bbx_B))#
We can abstract from both parameters making #ccT(-, =) : ccT^(op) xx ccT -> bb "Set"#
which, by an early exercise, can be shown to correspond with the collage #ccK_ccT#.
Via an abuse of notation, I’ll identify #ccT^(op)(A, -)# with #ccT(-, A)#, though technically we only have an isomorphism between the interpretations, and to talk about isomorphisms between interpretations we need the notion of…
The theories we’ve presented are (multi-sorted) universal algebra theories. Universal algebra allows us to specify a general notion of “homomorphism” that generalizes monoid homomorphism or group homomorphism or ring homomorphism or lattice homomorphism.
In universal algebra, the algebraic theory of groups consists of a single sort, a nullary operation, #1#, a binary operation, #*#
, a unary operation, #tt "inv"#
, and some equations which are unimportant for us. Operations correspond to our function symbols except that they’re are not restricted to being unary. A particular group is a particular interpretation of the algebraic theory of groups, i.e. it is a set and three functions into the set. A group homomorphism then is a function between those two groups, i.e. between the two interpretations, that preserves the operations. In a traditional presentation this would look like the following:
Say #alpha : G -> K# is a group homomorphism from the group #G# to the group #K# and #g, h in G# then:
#alpha(1_G) = 1_K#
`#alpha(g *_G h) = alpha(g) *_K alpha(h)#`{.asciimath}
`#alpha(tt "inv"_G(g)) = tt "inv"_K(alpha(g))#`{.asciimath}
Using something more akin to our notation, it would look like:
#alpha(⟦1⟧^G) = ⟦1⟧^K#
`#alpha(⟦*⟧^G(g,h)) = ⟦*⟧^K(alpha(g), alpha(h))#`{.asciimath}
`#alpha(⟦tt "inv"⟧^G(g)) = ⟦tt "inv"⟧^K(alpha(g))#`{.asciimath}
The #tt "inv"#
case is the most relevant for us as it is unary. However, for us, a function symbol #ttf# may have a different source and target and so we made need a different function on each side of the equation. E.g. for #ttf : A -> B#, #alpha : ccI_1 -> ccI_2#, and #a in ⟦A⟧^(ccI_1)# we’d have:
#alpha_B(⟦tt "f"⟧^(ccI_1)(a)) = ⟦tt "f"⟧^(ccI_2)(alpha_A(a))#
So a homomorphism #alpha : ccI_1 -> ccI_2 : ccT -> bb "Set"#
is a family of functions, one for each sort of #ccT#, that satisfies the above equation for every function symbol of #ccT#. We call the individual functions making up #alpha# components of #alpha#, and we have #alpha_A : ⟦A⟧^(ccI_1) -> ⟦A⟧^(ccI_2)#
. The definition for an interpretation into a theory, #ccT_2#, is identical except the components of #alpha# are terms of #ccT_2# and #a# can be replaced with #bbx_(⟦A⟧^(ccI_1))#. Two interpretations are isomorphic if we have homomorphism #alpha : ccI_1 -> ccI_2# such that each component is a bijection. This is the same as requiring a homomorphism #beta : ccI_2 -> ccI_1# such that for each #A#, #alpha_A(beta_A(x)) = x# and #beta_A(alpha_A(x)) = x#. A similar statement can be made for interpretations into theories, just replace #x# with #bbx_(⟦A⟧)#
.
Another way to look at homomorphisms is via collages. A homomorphism #alpha : ccI_1 -> ccI_2 : ccT -> bb "Set"#
gives rise to an interpretation of the collage #ccK_ccT#. The interpretation #ccI_alpha : ccK_ccT -> bb "Set"#
is defined by:
`#⟦tt "inl"(A)⟧^(ccI_alpha) -= ⟦A⟧^(ccI_1)#`{.asciimath}
`#⟦tt "inr"(A)⟧^(ccI_alpha) -= ⟦A⟧^(ccI_2)#`{.asciimath}
`#⟦tt "inl"(ttf)⟧^(ccI_alpha) -= ⟦ttf⟧^(ccI_1)#`{.asciimath}
`#⟦tt "inr"(ttf)⟧^(ccI_alpha) -= ⟦ttf⟧^(ccI_2)#`{.asciimath}
`#⟦tt "id"_A⟧^(ccI_alpha) -= alpha_A#`{.asciimath}
The homomorphism law guarantees that it satisfies the equation on #tt "id"#
. Conversely, given an interpretation of #ccK_ccT#, we have the homomorphism, #⟦tt "id"⟧ : ⟦tt "inl"(-)⟧ -> ⟦tt "inr"(-)⟧ : ccT -> bb "Set"#
. and the equation on #tt "id"#
is exactly the homomorphism law.
Consider a homomorphism #alpha : ccT(A, -) -> ccI#. The #alpha# needs to satisfy for every sort #B# and #C#, every function symbol #ttf : C -> D#, and every term #t : B -> C#:
#alpha_D(tt "f"(t)) = ⟦tt "f"⟧^ccI(alpha_C(t))#
Looking at this equation, the possibility of viewing it as a recursive “definition” leaps out suggesting that the action of #alpha# is completely determined by it’s action on the variables. Something like this, for example:
#alpha_D(tt "f"(tt "g"(tt "h"(bbx_A)))) = ⟦tt "f"⟧(alpha_C(tt "g"(tt "h"(bbx_A)))) = ⟦tt "f"⟧(⟦tt "g"⟧(alpha_B(tt "h"(bbx_A)))) = ⟦tt "f"⟧(⟦tt "g"⟧(⟦tt "h"⟧(alpha_A(bbx_A))))#
We can easily establish that there’s a one-to-one correspondence between the set of homomorphisms #ccT(A, -) -> ccI# and the elements of the set #⟦A⟧^ccI#. Given a homomorphism, #alpha#, we get an element of #⟦A⟧^ccI# via #alpha_A(bbx_A)#. Inversely, given an element #a in ⟦A⟧^ccI#, we can define a homomorphism #a^**#
via:
`#a_D^**(tt "f"(t)) -= ⟦tt "f"⟧^ccI(a_C^**(t))#`{.asciimath}
`#a_A^**(bbx_A) -= a#`{.asciimath}
which clearly satisfies the condition on homomorphisms by definition. It’s easy to verify that #(alpha_A(bbx_A))^** = alpha#
and immediately true that #a^**(bbx_A) = a#
establishing the bijection.
We can state something stronger. Given any homomorphism #alpha : ccT(A, -) -> ccI# and any function symbol #ttg : A -> X#, we can make a new homomorphism #alpha * ttg : ccT(X, -) -> ccI# via the following definition:
#(alpha * ttg)(t) = alpha(t(tt "g"(bbx_A)))#
Verifying that this is a homomorphism is straightforward:
#(alpha * ttg)(tt "f"(t)) = alpha(tt "f"(t(tt "g"(bbx_A)))) = ⟦tt "f"⟧(alpha(t(tt "g"(bbx_A)))) = ⟦tt "f"⟧((alpha * ttg)(t))#
and like any homomorphism of this form, as we’ve just established, it is completely determined by it’s action on variables, namely #(alpha * ttg)_A(bbx_A) = alpha_X(tt "g"(bbx_A)) = ⟦tt "g"⟧(alpha_A(bbx_A))#
. In particular, if #alpha = a^**#
, then we have #a^** * ttg = (⟦tt "g"⟧(a))^**#
. Together these facts establish that we have an interpretation #ccY : ccT -> bb "Set"#
such that #⟦A⟧^ccY -= (ccT(A, -) -> ccI)#
, the set of homomorphisms, and #⟦tt "g"⟧^ccY(alpha) -= alpha * tt "g"#
. The work we did before established that we have homomorphisms #(-)(bbx) : ccY -> ccI# and #(-)^** : ccI -> ccY#
that are inverses. This is true for all theories and all interpretations as at no point did we use any particular facts about them. This statement is the (dual form of the) Yoneda lemma. To get the usual form simply replace #ccT# with #ccT^(op)#. A particularly important and useful case (so useful it’s usually used tacitly) occurs when we choose #ccI = ccT(B,-)#, we get #(ccT(A, -) -> ccT(B, -)) ~= ccT(B, A)# or, choosing #ccT^(op)# everywhere, #(ccT(-, A) -> ccT(-, B)) ~= ccT(A, B)# which states that a term from #A# to #B# is equivalent to a homomorphism from #ccT(-, A)# to #ccT(-, B)#.
There is another result, dual in a different way, called the co-Yoneda lemma. It turns out it is a corollary of the fact that for a collage #ccK : ccT_1 ↛ ccT_2#, #ccK_(ccT_2) @ ccK ~= ccK#
and the dual is just the composition the other way. To get (closer to) the precise result, we need to be able to turn an interpretation into a collage. Given an interpretation, #ccI : ccT -> bb "Set"#
, we can define a collage #ccK_ccI : ccT_bb1 ↛ ccT# whose bridges from #1 -> A# are the elements of #⟦A⟧^ccI#. Given this, the co-Yoneda lemma is the special case, #ccK_ccT @ ccK_ccI ~= ccK_ccI#.
Note, that the Yoneda and co-Yoneda lemmas only apply to interpretations into sets as #ccY# involves the set of homomorphisms.
The Yoneda lemma suggests that the interpretations #ccT(A, -)# and #ccT(-, A)# are particularly important and this will be borne out as we continue.
We call an interpretation, #ccI : ccT^(op) -> bb "Set"#
representable if #ccI ~= ccT(-, X)# for some sort #X#. We then say that #X# represents #ccI#. What this states is that every term of sort #X# corresponds to an element in one of the sets that make up #ccI#, and these transform appropriately. There’s clearly a particularly important element, namely the image of #bbx_X# which corresponds to an element in #⟦X⟧^ccI#. This element is called the universal element. The dual concept is, for #ccI : ccT -> bb "Set"#
, #ccI# is co-representable if #ccI ~= ccT(X, -)#. We will also say #X# represents #ccI# in this case as it actually does when we view #ccI# as an interpretation of #(ccT^(op))^(op)#
.
As a rather liberating exercise, you should establish the following result called parameterized representability. Assume we have theories #ccT_1# and #ccT_2#, and a family of sorts of #ccT_2#, #X#, and a family of interpretations of #ccT_2^(op)#, #ccI#, both indexed by sorts of #ccT_1#, such that for each #A in bb "sort"(ccT_1)#
, #X_A# represents #ccI_A#, i.e. #ccI_A ~= ccT_2(-, X_A)#. Given all this, then there is a unique interpretation #ccX : ccT_1 -> ccT_2# and #ccI : ccT_1 xx ccT_2^(op) -> bb "Set"#
where #⟦A⟧^(ccX) -= X_A# and #"⟦("A, B")⟧"^ccI -= ⟦B⟧^(ccI_A)#
such that #ccI ~= ccT_2(=,⟦-⟧^ccX)#. To be a bit more clear, the right hand side means #(A, B) |-> ccT_2(B, ⟦A⟧^ccX)#. Simply by choosing #ccT_1# to be a product of multiple theories, we can generalize this result to an arbitrary number of parameters. What makes this result liberating is that we just don’t need to worry about the parameters, they will automatically transform homomorphically. As a technical warning though, since two interpretations may have the same action on sorts but a different action on function symbols, if the family #X_A# was derived from an interpretation #ccJ#, i.e. #X_A -= ⟦A⟧^ccJ#, it may not be the case that #ccX = ccJ#.
Let’s look at some examples.
As a not-so-special case of representability, we can consider #ccI -= ccK(tt "inl"(-), tt "inr"(Z))#
where #ccK : ccT_1 ↛ ccT_2#. Saying that #A# represents #ccI# in this case is saying that bridging terms of sort #tt "inr"(Z)#
, i.e. sort #Z# in #ccT_2#, in #ccK#, correspond to terms of sort #A# in #ccT_1#. We’ll call the universal element of this representation the universal bridge (though technically it may be a bridging term, not a bridge). Let’s write #varepsilon# for this universal bridge. What representability states in this case is given any bridging term #k# of sort #Z#, there exists a unique term #|~ k ~|# of sort #A# such that #k = varepsilon(|~ k ~|)#. If we have an interpretation #ccX : ccT_2 -> ccT_1# such that #⟦Z⟧^ccX# represents #ccK(tt "inl"(-), tt "inr"(Z))#
for each sort #Z# of #ccT_2# we say we have a right representation of #ccK#. Note, that the universal bridges become a family #varepsilon_Z : ⟦Z⟧^ccX -> Z#
. Similarly, if #ccK(tt "inl"(A), tt "inr"(-))#
is co-representable for each #A#, we say we have a left representation of #ccK#. The co-universal bridge is then a bridging term #eta_A : A -> ⟦A⟧# such that for any bridging term #k# with source #A#, there exists a unique term #|__ k __|#
in #ccT_2# such that #k = |__ k __|(eta_A)#
. For reference, we’ll call these equations universal properties of the left/right representation. Parameterized representability implies that a left/right representation is essentially unique.
Define #ccI_bb1# via #⟦A⟧^(ccI_bb1) -= bb1# where #bb1# is some one element set. #⟦ttf⟧^(ccI_bb1)# is the identity function for all function symbols #ttf#. We’ll say a theory #ccT# has a unit sort or has a terminal sort if there is a sort that we’ll also call #bb1# that represents #ccI_bb1#. Spelling out what that means, we first note that there is nothing notable about the universal element as it’s the only element. However, writing the homomorphism #! : ccI_bb1 -> ccT(-, bb1)# and noting that since there’s only one element of #⟦A⟧^(ccI_bb1)# we can, with a slight abuse of notation, also write the term #!# picks out as #!# which gives the equation:
#!_B(tt "g"(t)) = !_A(t)#
for any function symbol #ttg : A -> B# and term, #t#, of sort #A#, note#!_A : A -> bb1#
.
This equation states what the isomorphism also fairly directly states: there is exactly one term of sort #bb1# from any sort #A#, namely #!_A(bbx_A)#
. The dual notion is called a void sort or an initial sort and will usually be notated #bb0#, the analog of #!# will be written as #0#. The resulting equation is:
#tt "f"(0_A) = 0_B#
for any function symbol #ttf : A -> B#, note #0_A : bb0 -> A#.
For the next example, I’ll leverage collages. Consider the collage #ccK_2 : ccT ↛ ccT xx ccT# whose bridges from #A -> (B, C)# consist of pairs of terms #t_1 : A -> B# and #t_2 : A -> C#. #ccT# has pairs if #ccK_2# has a right representation. We’ll write #(B, C) |-> B xx C# for the representing interpretation’s action on sorts. We’ll write the universal bridge as #(tt "fst"(bbx_(B xx C)), tt "snd"(bbx_(B xx C)))#
. The universal property then looks like #(tt "fst"(bbx_(B xx C)), tt "snd"(bbx_(B xx C)))((: t_1, t_2 :)) = (t_1, t_2)#
where #(: t_1, t_2 :) : A -> B xx C# is the unique term induced by the bridge #(t_1, t_2)#. The universal property implies the following equations:
`#(: tt "fst"(bbx_(B xx C)), tt "snd"(bbx_(B xx C))) = bbx_(B xx C)#`{.asciimath}
`#tt "fst"((: t_1, t_2 :)) = t_1#`{.asciimath}
`#tt "snd"((: t_1, t_2 :)) = t_2#`{.asciimath}
One aspect of note is regardless of whether #ccK_2# has a right representation, i.e. regardless of whether #ccT# has pairs, it always has a left representation. The co-universal bridge is #(bbx_A, bbx_A)# and the unique term #|__(t_1, t_2)__|#
is #tt "inl"(t_1, bbx_A)(tt "inr"(bbx_A, t_2)(bbx_("("A,A")")))#
.
Define an interpretation #Delta : ccT -> ccT xx ccT# so that #⟦A⟧^Delta -= (A,A)# and similarly for function symbols. #Delta# left represents #ccK_2#. If the interpretation #(B,C) |-> B xx C# right represents #ccK_2#, then we say we have an adjunction between #Delta# and #(- xx =)#, written #Delta ⊣ (- xx =)#, and that #Delta# is left adjoint to #(- xx =)#, and conversely #(- xx =)# is right adjoint #Delta#.
More generally, whenever we have the situation #ccT_1(⟦-⟧^(ccI_1), =) ~= ccT_2(-, ⟦=⟧^(ccI_2))#
we say that #ccI_1 : ccT_2 -> ccT_1# is left adjoint to #ccI_2 : ccT_1 -> ccT_2# or conversely that #ccI_2# is right adjoint to #ccI_1#. We call this arrangement an adjunction and write #ccI_1 ⊣ ccI_2#. Note that we will always have this situation if #ccI_1# left represents and #ccI_2# right represents the same collage. As we noted above, parameterized representability actually determines one adjoint given (its action on sorts and) the other adjoint. With this we can show that adjoints are unique up to isomorphism, that is, given two left adjoints to an interpretation, they must be isomorphic. Similarly for right adjoints. This means that stating something is a left or right adjoint to some other known interpretation essentially completely characterizes it. One issue with adjunctions is that they tend to be wholesale. Let’s say the pair sort #A xx B# existed but no other pair sorts existed, then the (no longer parameterized) representability approach would work just fine, but the adjunction would no longer exist.
Here’s a few of exercises using this. First, a moderately challenging one (until you catch the pattern): spell out the details to the left adjoint to #Delta#. We say a theory has sums and write those sums as #A + B# if #(- + =) ⊣ Delta#. Recast void and unit sorts using adjunctions and/or left/right representations. As a terminological note, we say a theory has finite products if it has unit sorts and pairs. Similarly, a theory has finite sums or has finite coproducts if it has void sorts and sums. An even more challenging exercise is the following: a theory has exponentials if it has pairs and for every sort #A#, #(A xx -) ⊣ (A => -)# (note, parameterized representability applies to #A#). Spell out the equations characterizing #A => B#.
Finite products start to lift us off the ground. So far the theories we’ve been working with have been extremely basic: a language with only unary functions, all terms being just a sequence of applications of function symbols. It shouldn’t be underestimated though. It’s more than enough to do monoid and group theory. A good amount of graph theory can be done with just this. And obviously we were able to establish several general results assuming only this structure. Nevertheless, while we can talk about specific groups, say, we can’t talk about the theory of groups. Finite products change this.
A theory with finite products allows us to talk about multi-ary function symbols and terms by considering unary function symbols from products. This allows us to do all of universal algebra. For example, the theory of groups, #ccT_(bb "Grp")#
, consists of a sort #S# and all it’s products which we’ll abbreviate as #S^n# with #S^0 -= bb1# and #S^(n+1) -= S xx S^n#. It has three function symbols #tte : bb1 -> S#, #ttm : S^2 -> S#, and #tti : S -> S# plus the ones that having finite products requires. In fact, instead of just heaping an infinite number of sorts and function symbols into our theory — and we haven’t even gotten to equations — let’s define a compact set of data from which we can generate all this data.
A signature, #Sigma#, consists of a collection of sorts, #sigma#, a collection of multi-ary function symbols, and a collection of equations. Equations still remain pairs of terms, but we need to now extend our definition of terms for this context. A term (in a signature) is either a variable, #bbx_i^[A_0,A_1,...,A_n]#
where #A_i# are sorts and #0 <= i <= n#, the operators #tt "fst"#
or #tt "snd"#
applied to a term, the unit term written #(::)^A# with sort #A#, a pair of terms written #(: t_1, t_2 :)#, or the (arity correct) application of a multi-ary function symbol to a series of terms, e.g. #tt "f"(t_1, t_2, t_3)#
. As a Haskell data declaration, it might look like:
data SigTerm
= SigVar [Sort] Int
| Fst SigTerm
| Snd SigTerm
| Unit Sort
| Pair SigTerm SigTerm
| SigApply FunctionSymbol [SigTerm]
At this point, sorting (i.e. typing) the terms is no longer trivial, though it is still pretty straightforward. Sorts are either #bb1#, or #A xx B# for sorts #A# and #B#, or a sort #A in sigma#. The source of function symbols or terms are lists of sorts.
`#bbx_i^[A_0, A_1, ..., A_n] : [A_0, A_1, ..., A_n] -> A_i#`{.asciimath}
`#(::)^A : [A] -> bb1#`{.asciimath}
`#(: t_1, t_2 :) : bar S -> T_1 xx T_2#`{.asciimath} where #t_i : bar S -> T_i#
`#tt "fst"(t) : bar S -> T_1#`{.asciimath} where #t : bar S -> T_1 xx T_2#
`#tt "snd"(t) : bar S -> T_2#`{.asciimath} where #t : bar S -> T_1 xx T_2#
`#tt "f"(t_1, ..., t_n) : bar S -> T#`{.asciimath} where #t_i : bar S -> T_i# and `#ttf : [T_1,...,T_n] -> T#`{.asciimath}
The point of a signature was to represent a theory so we can compile a term of a signature into a term of a theory with finite products. The theory generated from a signature #Sigma# has the same sorts as #Sigma#. The equations will be equations of #Sigma#, with the terms compiled as will be described momentarily, plus for every pair of sorts the equations that describe pairs and the equations for #!#. Finally, we need to describe how to take a term of the signature and make a function symbol of the theory, but before we do that we need to explain how to convert those sources of the terms which are lists. That’s just a conversion to right nested pairs, #[A_0,...,A_n] |-> A_0 xx (... xx (A_n xx bb1) ... )#
. The compilation of a term #t#, which we’ll write as #ccC[t]#
, is defined as follows:
`#ccC[bbx_i^[A_0, A_1, ..., A_n]] = tt "snd"^i(tt "fst"(bbx_(A_i xx(...))))#`{.asciimath} where `#tt "snd"^i#`{.asciimath} means the #i#-fold application of `#tt "snd"#`{.asciimath}
`#ccC[(::)^A] = !_A#`{.asciimath}
`#ccC[(: t_1, t_2 :)] = (: ccC[t_1], ccC[t_2] :)#`{.asciimath}
`#ccC[tt "fst"(t)] = tt "fst"(ccC[t])#`{.asciimath}
`#ccC[tt "snd"(t)] = tt "snd"(ccC[t])#`{.asciimath}
`#ccC[tt "f"(t_1, ..., t_n)] = tt "f"((: ccC[t_1], (: ... , (: ccC[t_n], ! :) ... :) :))#`{.asciimath}
As you may have noticed, the generated theory will have an infinite number of sorts, an infinite number of function symbols, and an infinite number of equations no matter what the signature is — even an empty one! Having an infinite number of things isn’t a problem as long as we can algorithmically describe them and this is what the signature provides. Of course, if you’re a (typical) mathematician you nominally don’t care about an algorithmic description. Besides being compact, signatures present a nicer term language. The theories are like a core or assembly language. We could define a slightly nicer variation where we keep a context and manage named variables leading to terms-in-context like:
#x:A, y:B |-- tt "f"(x, x, y)#
which is
#tt "f"(bbx_0^[A,B], bbx_0^[A,B], bbx_1^[A,B])#
for our current term language for signatures. Of course, compilation will be (slightly) trickier for the nicer language.
The benefit of having compiled the signature to a theory, in addition to being able to reuse the results we’ve established for theories, is we only need to define operations on the theory, which is simpler since we only need to deal with pairs and unary function symbols. One example of this is we’d like to extend our notion of interpretation to one that respects the structure of the signature, and we can do that by defining an interpretation of theories that respects finite products.
A finite product preserving interpretation (into a finite product theory), #ccI#, is an interpretation (into a finite product theory) that additionally satisfies:
`#⟦bb1⟧^ccI ~~ bb1#`{.asciimath}
`#⟦A xx B⟧^ccI ~~ ⟦A⟧^ccI xx ⟦B⟧^ccI#`{.asciimath}
`#⟦!_A⟧^ccI = !_(⟦A⟧^ccI)#`{.asciimath}
`#⟦tt "fst"(t)⟧^ccI = tt "fst"(⟦t⟧^ccI)#`{.asciimath}
`#⟦tt "snd"(t)⟧^ccI = tt "snd"(⟦t⟧^ccI)#`{.asciimath}
`#⟦(: t_1, t_2 :)⟧^ccI = (: ⟦t_1⟧^ccI, ⟦t_2⟧^ccI :)#`{.asciimath}
where, for #bb "Set"#
, #bb1 -= {{}}#, #xx# is the cartesian product, #tt "fst"#
and #tt "snd"#
are the projections, #!_A -= x |-> \{\}#
, and #(: f, g :) -= x |-> (: f(x), g(x) :)#.
With signatures, we can return to our theory, now signature, of groups. #Sigma_bb "Grp"#
has a single sort #S#, three function symbols #tte : [bb1] -> S#
, #tti : [S] -> S#
, and #ttm : [S, S] -> S#
, with the following equations:
`#tt "m"(tt "e"((::)^S), bbx_0^S) ~~ bbx_0^S#`{.asciimath}
`#tt "m"(tt "i"(bbx_0^S), bbx_0^S) ~~ tt "e"((::)^S)#`{.asciimath}
`#tt "m"(tt "m"(bbx_0^[S,S,S], bbx_1^[S,S,S]), bbx_2^[S,S,S]) ~~ tt "m"(bbx_0^[S,S,S], tt "m"(bbx_1^[S,S,S], bbx_2^[S,S,S]))#`{.asciimath}
or using the nicer syntax:
`#x:S |-- tt "m"(tt "e"(), x) ~~ x#`{.asciimath}
`#x:S |-- tt "m"(tt "i"(x), x) ~~ tt "e"()#`{.asciimath}
`#x:S, y:S, z:S |-- tt "m"(tt "m"(x, y), z) ~~ tt "m"(x, tt "m"(y, z))#`{.asciimath}
An actual group is then just a finite product preserving interpretation of (the theory generated by) this signature. All of universal algebra and much of abstract algebra can be formulated this way.
We can consider additionally assuming that our theory has exponentials. I left articulating exactly what that means as an exercise, but the upshot is we have the following two operations:
For any term #t : A xx B -> C#, we have the term #tt "curry"(t) : A -> C^B#
. We also have the homomorphism #tt "app"_(AB) : B^A xx A -> B#
. They satisfy:
#tt "curry"(tt "app"(bbx_(B^A xx A))) = bbx_(B^A)#
#tt "app"((: tt "curry"(t_1), t_2 :)) = t_1((: bbx_A, t_2 :))#
where #t_1 : A xx B -> C# and #t_2 : A -> B#.
We can view these, together with the the product operations, as combinators, and it turns out we can compile the simply typed lambda calculus into the above theory. This is exactly what the Categorical Abstract Machine did. The “Caml” in “O’Caml” stands for “Categorical Abstract Machine Language”, though O’Caml no longer uses the CAM. Conversely, every term of the theory can be expressed as a simply typed lambda term. This means we can view the simply typed lambda calculus as just a different presentation of the theory.
At this point, this presentation of category theory starts to connect to the mainstream categorical literature on universal algebra, internal languages, sketches, and internal logic. This page gives a synopsis of the relationship between type theory and category theory. For some reason, it is unusual to talk about the internal language of a plain category, but that is exactly what we’ve done here.
I haven’t talked about finite limits or colimits beyond products and coproducts, nor have I talked about even the infinitary versions of products and coproducts, let alone arbitrary limits and colimits. These can be handled the same way as products and coproducts. Formulating a language like signatures or the simply typed lambda calculus is a bit more complicated, but not that hard. I may make a follow-up article covering this among other things. I also have a side project (don’t hold your breath), that implements the internal language of a category with finite limits. The result looks roughly like a simple version of an algebraic specification language like the OBJ family. The RING
theory described in the Maude manual gives an idea of what it would look like. In fact, here’s an example of the current actual syntax I’m using.^{3}
theory Categories
type O
type A
given src : A -> O
given tgt : A -> O
given id : O -> A
satisfying o:O | src (id o) = o, tgt (id o) = o
given c : { f:A, g:A | src f = tgt g } -> A
satisfying (f, g):{ f:A, g:A | src f = tgt g }
| tgt (c (f, g)) = tgt f, src (c (f, g)) = src g
satisfying "left unit" (o, f):{ o:O, f:A | tgt f = o }
| c (id o, f) = f
satisfying "right unit" (o, f):{ o:O, f:A | src f = o }
| c (f, id o) = f
satisfying "associativity" (f, g, h):{ f:A, g:A, h:A | src f = tgt g, src g = tgt h }
| c (c (f, g), h) = c (f, c (g, h))
endtheory
It turns out this is a particularly interesting spot in the design space. The fact that the theory of theories with finite limits is itself a theory with finite limits has interesting consequences. It is still relatively weak though. For example, it’s not possible to describe the theory of fields in this language.
There are other directions one could go. For example, the internal logic of monoidal categories is (a fragment of) ordered linear logic. You can cross this bridge either way. You can look at different languages and consider what categorical structure is needed to support the features of the language, or you can add features to the category and see how that impacts the internal language. The relationship is similar to the source language and a core/intermediate language in a compiler, e.g. GHC Haskell and System Fc.
If you’ve looked at category theory at all, you can probably make most of the connections without me telling you. The table below outlines the mapping, but there are some subtleties. First, as a somewhat technical detail, my definition of a theory corresponds to a small category, i.e. a category which has a set of objects and a set of arrows. For more programmer types, you should think of “set” as Set
in Agda, i.e. similar to the *
kind in Haskell. Usually “category” means “locally small category” which may have a proper class of objects and between any two objects a set of arrows (though the union of all those sets may be a proper class). Again, for programmers, the distinction between “class” and “set” is basically the difference between Set
and Set1
in Agda.^{4} To make my definition of theory closer to this, all that is necessary is instead of having a set of function symbols, have a family of sets indexed by pairs of objects. Here’s what a partial definition in Agda of the two scenarios would look like:
-- Small category (the definition I used)
record SmallCategory : Set1 where
field
objects : Set
arrows : Set
src : arrows -> objects
tgt : arrows -> objects
...
-- Locally small category
record LocallySmallCategory : Set2 where
field
objects : Set1
hom : objects -> objects -> Set
...
-- Different presentation of a small category
record SmallCategory' : Set1 where
field
objects : Set
hom : objects -> objects -> Set
...
The benefit of the notion of locally small category is that Set
itself is a locally small category. The distinction I was making between interpretations into theories and interpretations into Set was due to the fact that Set wasn’t a theory. If I used a definition theory corresponding to a locally small category, I could have combined the notions of interpretation by making Set a theory. The notion of a small category, though, is still useful. Also, an interpretation into Set corresponds to the usual notion of a model or semantics, while interpretations into other theories was a less emphasized concept in traditional model theory and universal algebra.
A less technical and more significant difference is that my definition of a theory doesn’t correspond to a category, but rather to a presentation of a category, from which a category can be generated. The analog of arrows in a category is terms, not function symbols. This is a bit more natural route from the model theory/universal algebra/programming side. Similarly, having an explicit collection of equations, rather than just an equivalence relation on terms is part of the presentation of the category but not part of the category itself.
model theory | category theory |
---|---|
sort | object |
term | arrow |
function symbol | generating arrow |
theory | presentation of a (small) category |
collage | collage, cograph of a profunctor |
bridge | heteromorphism |
signature | presentation of a (small) category with finite products |
interpretation into sets, aka models | a functor into Set, a (co)presheaf |
interpretation into a theory | functor |
homomorphism | natural transformation |
simply typed lambda calculus (with products) | a cartesian closed category |
In some ways I’ve stopped just when things were about to get good. I may do a follow-up to elaborate on this good stuff. Some examples are: if I expand the definition so that Set becomes a “theory”, then interpretations also form such a “theory”, and these are often what we’re really interested in. The category of finite-product preserving interpretations of the theory of groups essentially is the category of groups. In fact, universal algebra is, in categorical terms, just the study of categories with finite products and finite-product preserving functors from them, particularly into Set. It’s easy to generalize this in many directions. It’s also easy to make very general definitions, like a general definition of a free algebraic structure. In general, we’re usually more interested in the interpretations of a theory than the theory itself.
While I often do advocate thinking in terms of internal languages of categories, I’m not sure that it is a preferable perspective for the very basics of category theory. Nevertheless, there are a few reasons for why I wrote this. First, this very syntactical approach is, I think, more accessible to someone coming from a programming background. From this view, a category is a very simple programming language. Adding structure to the category corresponds to adding features to this programming language. Interpretations are denotational semantics.
Another aspect about this presentation that is quite different is the use and emphasis on collages. Collages correspond to profunctors, a crucially important and enabling concept that is rarely covered in categorical introductions. The characterization of profunctors as collages in Vaughn Pratt’s paper (not using that name) was one of the things I enjoyed about that paper and part of what prompted me to start writing this. In earlier drafts of this article, I was unable to incorporate collages in a meaningful way as I was trying to start from profunctors. This approach just didn’t add value. Collages just looked like a bizarre curio and weren’t integrated into the narrative at all. For other reasons, though, I ended up revisiting the idea of a heteromorphism. My (fairly superficial) opinion is that once you have the notion of functors and natural transformations, adding the notion of heteromorphisms has a low power-to-weight ratio, though it does make some things a bit nicer. Nevertheless, in thinking of how best to fit them into this context, it was clear that collages provided the perfect mechanism (which isn’t a big surprise), and the result works rather nicely. When I realized a fact that can be cryptically but compactly represented as #ccK_ccT ≃ bbbI xx ccT# where #bbbI# is the interval category, i.e. two objects with a single arrow joining them, I realized that this is actually an interesting perspective. Since most of this article was written at that point, I wove collages into the narrative replacing some things. If, though, I had started with this perspective from the beginning I suspect I would have made a significantly different article, though the latter sections would likely be similar.
It’s actually better to organize this as a family of collections of function symbols indexed by pairs of sorts.↩
Instead of having equations that generate an equivalence relation on (raw) terms, we could simply require an equivalence relation on (raw) terms be directly provided.↩
Collaging is actually quite natural in this context. I already intend to support one theory importing another. A collage is just a theory that imports two others and then adds function symbols between them.↩
For programmers familiar with Agda, at least, if you haven’t made this connection, this might help you understand and appreciate what a “class” is versus a “set” and what “size issues” are, which is typically handled extremely vaguely in a lot of the literature.↩
I’ve been watching the Spring 2012 lectures for MIT 6.851 Advanced Data Structures with Prof. Erik Demaine. In lecture 12, “Fusion Trees”, it mentions a constant time algorithm for finding the index of the first most significant 1 bit in a word, i.e. the binary logarithm. Assuming word operations are constant time, i.e. in the Word RAM model, the below algorithm takes 27 word operations (not counting copying). When I compiled it with GHC 8.0.1 -O2 the core of the algorithm was 44 straight-line instructions. The theoretically interesting thing is, other than changing the constants, the same algorithm works for any word size that’s an even power of 2. Odd powers of two need a slight tweak. This is demonstrated for Word64
, Word32
, and Word16
. It should be possible to do this for any arbitrary word size w
.
The clz
instruction can be used to implement this function, but this is a potential simulation if that or a similar instruction wasn’t available. It’s probably not the fastest way. Similarly, find first set and count trailing zeros can be implemented in terms of this operation.
Below is the complete code. You can also download it here.
{-# LANGUAGE BangPatterns #-}
import Data.Word
import Data.Bits
-- Returns 0-based bit index of most significant bit that is 1. Assumes input is non-zero.
-- That is, 2^indexOfMostSignificant1 x <= x < 2^(indexOfMostSignificant1 x + 1)
-- From Erik Demaine's presentation in Spring 2012 lectures of MIT 6.851, particularly "Lecture 12: Fusion Trees".
-- Takes 26 (source-level) straight-line word operations.
indexOfMostSignificant1 :: Word64 -> Word64
indexOfMostSignificant1 w = idxMsbyte .|. idxMsbit
where
-- top bits of each byte
!wtbs = w .&. 0x8080808080808080
-- all but top bits of each byte producing 8 7-bit chunks
!wbbs = w .&. 0x7F7F7F7F7F7F7F7F
-- parallel compare of each 7-bit chunk to 0, top bit set in result if 7-bit chunk was not 0
!pc = parallelCompare 0x8080808080808080 wbbs
-- top bit of each byte set if the byte has any bits set in w
!ne = wtbs .|. pc
-- a summary of which bytes (except the first) are non-zero as a 7-bit bitfield, i.e. top bits collected into bottom byte
!summary = sketch ne `unsafeShiftR` 1
-- parallel compare summary to powers of two
!cmpp2 = parallelCompare 0xFFBF9F8F87838180 (0x0101010101010101 * summary)
-- index of most significant non-zero byte * 8
!idxMsbyte = sumTopBits8 cmpp2
-- most significant 7-bits of most significant non-zero byte
!msbyte = ((w `unsafeShiftR` (fromIntegral idxMsbyte)) .&. 0xFF) `unsafeShiftR` 1
-- parallel compare msbyte to powers of two
!cmpp2' = parallelCompare 0xFFBF9F8F87838180 (0x0101010101010101 * msbyte)
-- index of most significant non-zero bit in msbyte
!idxMsbit = sumTopBits cmpp2'
-- Maps top bits of each byte into lower byte assuming all other bits are 0.
-- 0x2040810204081 = sum [2^j | j <- map (\i -> 49 - 7*i) [0..7]]
-- In general if w = 2^(2*k+p) and p = 0 or 1 the formula is:
-- sum [2^j | j <- map (\i -> w-(2^k-1) - 2^(k+p) - (2^(k+p) - 1)*i) [0..2^k-1]]
-- Followed by shifting right by w - 2^k
sketch w = (w * 0x2040810204081) `unsafeShiftR` 56
parallelCompare w1 w2 = complement (w1 - w2) .&. 0x8080808080808080
sumTopBits w = ((w `unsafeShiftR` 7) * 0x0101010101010101) `unsafeShiftR` 56
sumTopBits8 w = ((w `unsafeShiftR` 7) * 0x0808080808080808) `unsafeShiftR` 56
indexOfMostSignificant1_w32 :: Word32 -> Word32
indexOfMostSignificant1_w32 w = idxMsbyte .|. idxMsbit
where !wtbs = w .&. 0x80808080
!wbbs = w .&. 0x7F7F7F7F
!pc = parallelCompare 0x80808080 wbbs
!ne = wtbs .|. pc
!summary = sketch ne `unsafeShiftR` 1
!cmpp2 = parallelCompare 0xFF838180 (0x01010101 * summary)
!idxMsbyte = sumTopBits8 cmpp2
!msbyte = ((w `unsafeShiftR` (fromIntegral idxMsbyte)) .&. 0xFF) `unsafeShiftR` 1
!cmpp2' = parallelCompare 0x87838180 (0x01010101 * msbyte)
-- extra step when w is not an even power of two
!cmpp2'' = parallelCompare 0xFFBF9F8F (0x01010101 * msbyte)
!idxMsbit = sumTopBits cmpp2' + sumTopBits cmpp2''
sketch w = (w * 0x204081) `unsafeShiftR` 28
parallelCompare w1 w2 = complement (w1 - w2) .&. 0x80808080
sumTopBits w = ((w `unsafeShiftR` 7) * 0x01010101) `unsafeShiftR` 24
sumTopBits8 w = ((w `unsafeShiftR` 7) * 0x08080808) `unsafeShiftR` 24
indexOfMostSignificant1_w16 :: Word16 -> Word16
indexOfMostSignificant1_w16 w = idxMsnibble .|. idxMsbit
where !wtbs = w .&. 0x8888
!wbbs = w .&. 0x7777
!pc = parallelCompare 0x8888 wbbs
!ne = wtbs .|. pc
!summary = sketch ne `unsafeShiftR` 1
!cmpp2 = parallelCompare 0xFB98 (0x1111 * summary)
!idxMsnibble = sumTopBits4 cmpp2
!msnibble = ((w `unsafeShiftR` (fromIntegral idxMsnibble)) .&. 0xF) `unsafeShiftR` 1
!cmpp2' = parallelCompare 0xFB98 (0x1111 * msnibble)
!idxMsbit = sumTopBits cmpp2'
sketch w = (w * 0x249) `unsafeShiftR` 12
parallelCompare w1 w2 = complement (w1 - w2) .&. 0x8888
sumTopBits w = ((w `unsafeShiftR` 3) * 0x1111) `unsafeShiftR` 12
sumTopBits4 w = ((w `unsafeShiftR` 3) * 0x4444) `unsafeShiftR` 12
Programmers in typed languages with higher order functions and algebraic data types are already comfortable with most of the basic constructions of set/type theory. In categorical terms, those programmers are familiar with finite products and coproducts and (monoidal/cartesian) closed structure. The main omissions are subset types (equalizers/pullbacks) and quotient types (coequalizers/pushouts) which would round out limits and colimits. Not having a good grasp on either of these constructions dramatically shrinks the world of mathematics that is understandable, but while subset types are fairly straightforward, quotient types are quite a bit less intuitive.
In my opinion, most programmers can more or less immediately understand the notion of a subset type at an intuitive level.
A subset type is just a type combined with a predicate on that type that specifies which values of the type we want. For example, we may have something like { n:Nat | n /= 0 }
meaning the type of naturals not equal to #0#. We may use this in the type of the division function for the denominator. Consuming a value of a subset type is easy, a natural not equal to #0# is still just a natural, and we can treat it as such. The difficult part is producing a value of a subset type. To do this, we must, of course, produce a value of the underlying type — Nat
in our example — but then we must further convince the type checker that the predicate holds (e.g. that the value does not equal #0#). Most languages provide no mechanism to prove potentially arbitrary facts about code, and this is why they do not support subset types. Dependently typed languages do provide such mechanisms and thus either have or can encode subset types. Outside of dependently typed languages the typical solution is to use an abstract data type and use a runtime check when values of that abstract data type are created.
The dual of subset types are quotient types. My impression is that this construction is the most difficult basic construction for people to understand. Further, programmers aren’t much better off, because they have little to which to connect the idea. Before I give a definition, I want to provide the example with which most people are familiar: modular (or clock) arithmetic. A typical way this is first presented is as a system where the numbers “wrap-around”. For example, in arithmetic mod #3#, we count #0#, #1#, #2#, and then wrap back around to #0#. Programmers are well aware that it’s not necessary to guarantee that an input to addition, subtraction, or multiplication mod #3# is either #0#, #1#, or #2#. Instead, the operation can be done and the mod
function can be applied at the end. This will give the same result as applying the mod
function to each argument at the beginning. For example, #4+7 = 11# and #11 mod 3 = 2#, and #4 mod 3 = 1# and #7 mod 3 = 1# and #1+1 = 2 = 11 mod 3#.
For mathematicians, the type of integers mod #n# is represented by the quotient type #ZZ//n ZZ#. The idea is that the values of #ZZ // n ZZ# are integers except that we agree that any two integers #a# and #b# are treated as equal if #a - b = kn# for some integer #k#. For #ZZ // 3 ZZ#, #… -6 = -3 = 0 = 3 = 6 = …# and #… = -5 = -2 = 1 = 4 = 7 = …# and #… = -4 = -1 = 2 = 5 = 8 = …#.
To start to formalize this, we need the notion of an equivalence relation. An equivalence relation is a binary relation #(~~)#
which is reflexive (#x ~~ x# for all #x#), symmetric (if #x ~~ y#
then #y ~~ x#
), and transitive (if #x ~~ y#
and #y ~~ z#
then #x ~~ z#
). We can check that “#a ~~ b# iff there exists an integer #k# such that #a-b = kn#” defines an equivalence relation on the integers for any given #n#. For reflexivity we have #a - a = 0n#. For symmetry we have if #a - b = kn# then #b - a = -kn#. Finally, for transitivity we have if #a - b = k_1 n# and #b - c = k_2 n# then #a - c = (k_1 + k_2)n# which we get by adding the preceding two equations.
Any relation can be extended to an equivalence relation. This is called the reflexive-, symmetric-, transitive-closure of the relation. For an arbitrary binary relation #R# we can define the equivalence relation #(~~_R)# via "#a ~~_R b# iff #a = b# or #R(a, b)# or #b ~~_R a# or #a ~~_R c and c ~~_R b# for some #c#". To be precise, #~~_R# is the smallest relation satisfying those constraints. In Datalog syntax, this looks like:
If #T# is a type, and #(~~)#
is an equivalence relation, we use #T // ~~# as the notation for the quotient type, which we read as “#T# quotiented by the equivalence relation #(~~)#
”. We call #T# the underlying type of the quotient type. We then say #a = b# at type #T // ~~# iff #a ~~ b#. Dual to subset types, to produce a value of a quotient type is easy. Any value of the underlying type is a value of the quotient type. (In type theory, this produces the perhaps surprising result that #ZZ# is a subtype of #ZZ // n ZZ#.) As expected, consuming a value of a quotient type is more complicated. To explain this, we need to explain what a function #f : T // ~~ -> X# is for some type #X#. A function #f : T // ~~ -> X# is a function #g : T -> X# which satisfies #g(a) = g(b)# for all #a# and #b# such that #a ~~ b#. We call #f# (or #g#, they are often conflated) well-defined if #g# satisfies this condition. In other words, any well-defined function that consumes a quotient type isn’t allowed to produce an output that distinguishes between equivalent inputs. A better way to understand this is that quotient types allow us to change what the notion of equality is for a type. From this perspective, a function being well-defined just means that it is a function. Taking equal inputs to equal outputs is one of the defining characteristics of a function.
Sometimes we can finesse needing to check the side condition. Any function #h : T -> B# gives rise to an equivalence relation on #T# via #a ~~ b# iff #h(a) = h(b)#. In this case, any function #g : B -> X# gives rise to a function #f : T // ~~ -> X# via #f = g @ h#. In particular, when #B = T# we are guaranteed to have a suitable #g# for any function #f : T // ~~ -> X#. In this case, we can implement quotient types in a manner quite similar subset types, namely we make an abstract type and we normalize with the #h# function as we either produce or consume values of the abstract type. A common example of this is rational numbers. We can reduce a rational number to lowest terms either when it’s produced or when the numerator or denominator get accessed, so that we don’t accidentally write functions which distinguish between #1/2# and #2/4#. For modular arithmetic, the mod by #n# function is a suitable #h#.
In set theory such an #h# function can always be made by mapping the elements of #T# to the equivalence classes that contain them, i.e. #a# gets mapped to #{b | a ~~ b}# which is called the equivalence class of #a#. In fact, in set theory, #T // ~~# is usually defined to be the set of equivalence classes of #(~~)#
. So, for the example of #ZZ // 3 ZZ#, in set theory, it is a set of exactly three elements: the elements are #{ 3n+k | n in ZZ}# for #k = 0, 1, 2#. Equivalence classes are also called partitions and are said to partition the underlying set. Elements of these equivalence classes are called representatives of the equivalence class. Often a notation like #[a]# is used for the equivalence class of #a#.
Here is a quick run-through of some significant applications of quotient types. I’ll give the underlying type and the equivalence relation and what the quotient type produces. I’ll leave it as an exercise to verify that the equivalence relations really are equivalence relations, i.e. reflexive, symmetric, and transitive. I’ll start with more basic examples. You should work through them to be sure you understand how they work.
Integers can be presented as pairs of naturals #(n, m)# with the idea being that the pair represents “#n - m#”. Of course, #1 - 2# should be the same as #2 - 3#. This is expressed as #(n_1, m_1) ~~ (n_2, m_2)# iff #n_1 + m_2 = n_2 + m_1#. Note how this definition only relies on operations on natural numbers. You can explore how to define addition, subtraction, multiplication, and other operations on this representation in a well-defined manner.
Rationals can be presented very similarly to integers, only with multiplication instead of addition. We also have pairs #(n, d)#, usually written #n/d#, in this case of an integer #n# and a non-zero natural #d#. The equivalence relation is #(n_1, d_1) ~~ (n_2, d_2)# iff #n_1 d_2 = n_2 d_1#.
We can extend the integers mod #n# to the continuous case. Consider the real numbers with the equivalence relation #r ~~ s# iff #r - s = k# for some integer #k#. You could call this the reals mod #1#. Topologically, this is a circle. If you walk along it far enough, you end up back at a point equivalent to where you started. Occasionally this is written as #RR//ZZ#.
Doing the previous example in 2D gives a torus. Specifically, we have pairs of real numbers and the equivalence relation #(x_1, y_1) ~~ (x_2, y_2)# iff #x_1 - x_2 = k# and #y_1 - y_2 = l# for some integers #k# and #l#. Quite a bit of topology relies on similar constructions as will be expanded upon on the section on gluing.
Here’s an example that’s a bit closer to programming. Consider the following equivalence relation on arbitrary pairs: #(a_1, b_1) ~~ (a_2, b_2)# iff #a_1 = a_2 and b_1 = b_2# or #a_1 = b_2 and b_1 = a_2#. This just says that a pair is equivalent to either itself, or a swapped version of itself. It’s interesting to consider what a well-defined function is on this type.^{1}
Returning to topology and doing a bit more involved construction, we arrive at gluing or pushouts. In topology, we often want to take two topological spaces and glue them together in some specified way. For example, we may want to take two discs and glue their boundaries together. This gives a sphere. We can combine two spaces into one with the disjoint sum (or coproduct, i.e. Haskell’s Either
type.) This produces a space that contains both the input spaces, but they don’t interact in any way. You can visualize them as sitting next to each other but not touching. We now want to say that certain pairs of points, one from each of the spaces, are really the same point. That is, we want to quotient by an equivalence relation that would identify those points. We need some mechanism to specify which points we want to identify. One way to accomplish this is to have a pair of functions, #f : C -> A# and #g : C -> B#, where #A# and #B# are the spaces we want to glue together. We can then define a relation #R# on the disjoint sum via #R(a, b)# iff there’s a #c : C# such that #a = tt "inl"(f(c)) and b = tt "inr"(g(c))#
. This is not an equivalence relation, but we can extend it to one. The quotient we get is then the gluing of #A# and #B# specified by #C# (or really by #f# and #g#). For our example of two discs, #f# and #g# are the same function, namely the inclusion of the boundary of the disc into the disc. We can also glue a space to itself. Just drop the disjoint sum part. Indeed, the circle and torus are examples.
We write #RR[X]# for the type of polynomials with one indeterminate #X# with real coefficients. For two indeterminates, we write #RR[X, Y]#. Values of these types are just polynomials such as #X^2 + 1# or #X^2 + Y^2#. We can consider quotienting these types by equivalence relations generated from identifications like #X^2 + 1 ~~ 0# or #X^2 - Y ~~ 0#, but we want more than just the reflexive-, symmetric-, transitive-closure. We want this equivalence relation to also respect the operations we have on polynomials, in particular, addition and multiplication. More precisely, we want if #a ~~ b# and #c ~~ d# then #ac ~~ bd# and similarly for addition. An equivalence relation that respects all operations is called a congruence. The standard notation for the quotient of #RR[X, Y]# by a congruence generated by both of the previous identifications is #RR[X, Y]//(X^2 + 1, X^2 - Y)#. Now if #X^2 + 1 = 0# in #RR[X, Y]//(X^2 + 1, X^2 - Y)#, then for any polynomial #P(X, Y)#, we have #P(X, Y)(X^2 + 1) = 0# because #0# times anything is #0#. Similarly, for any polynomial #Q(X, Y)#, #Q(X, Y)(X^2 - Y) = 0#. Of course, #0 + 0 = 0#, so it must be the case that #P(X, Y)(X^2 + 1) + Q(X, Y)(X^2 - Y) = 0# for all polynomials #P# and #Q#. In fact, we can show that all elements in the equivalence class of #0# are of this form. You’ve now motivated the concrete definition of a ring ideal and given it’s significance. An ideal is an equivalence class of #0# with respect to some congruence. Let’s work out what #RR[X, Y]//(X^2 + 1, X^2 - Y)# looks like concretely. First, since #X^2 - Y = 0#, we have #Y = X^2# and so we see that values of #RR[X, Y]//(X^2 + 1, X^2 - Y)# will be polynomials in only one indeterminate because we can replace all #Y#s with #X^2#s. Since #X^2 = -1#, we can see that all those polynomials will be linear (i.e. of degree 1) because we can just keep replacing #X^2#s with #-1#s, i.e. #X^(n+2) = X^n X^2 = -X^n#. The end result is that an arbitrary polynomial in #RR[X, Y]//(X^2 + 1, X^2 - Y)# looks like #a + bX# for real numbers #a# and #b# and we have #X^2 = -1#. In other words, #RR[X, Y]//(X^2 + 1, X^2 - Y)# is isomorphic to the complex numbers, #CC#.
As a reasonably simple exercise, given a polynomial #P(X) : RR[X]#, what does it get mapped to when embedded into #RR[X]//(X - 3)#, i.e. what is #[P(X)] : RR[X]//(X - 3)#?^{2}
Moving much closer to programming, we have a rather broad and important example that a mathematician might describe as free algebras modulo an equational theory. This example covers several of the preceding examples. In programmer-speak, a free algebra is just a type of abstract syntax trees for some language. We’ll call a specific abstract syntax tree a term. An equational theory is just a collection of pairs of terms with the idea being that we’d like these terms to be considered equal. To be a bit more precise, we will actually allow terms to contain (meta)variables. An example equation for an expression language might be Add(
#x#,
#x#) = Mul(2,
#x#)
. We call a term with no variables a ground term. We say a ground term matches another term if there is a consistent substitution for the variables that makes the latter term syntactically equal to the ground term. E.g. Add(3, 3)
matches Add(
#x#,
#x#)
via the substitution #x |->#3
. Now, the equations of our equational theory gives rise to a relation on ground terms #R(t_1, t_2)# iff there exists an equation #l = r# such that #t_1# matches #l# and #t_2# matches #r#. This relation can be extended to an equivalence relation on ground terms, and we can then quotient by that equivalence relation.
Let’s consider a worked example. We can consider the theory of monoids. We have two operations (types of AST nodes): Mul(
#x#,
#y#)
and 1
. We have the following three equations: Mul(1,
#x#) =
#x#, Mul(
#x#, 1) =
#x#, and Mul(Mul(
#x#,
#y#),
#z#) = Mul(
#x#, Mul(
#y#,
#z#))
. We additionally have a bunch of constants subject to no equations. In this case, it turns out we can define a normalization function, what I called #h# far above, and that the quotient type is isomorphic to lists of constants. Now, we can extend this theory to the theory of groups by adding a new operation, Inv(
#x#)
, and new equations: Inv(Inv(
#x#)) =
#x#, Inv(Mul(
#x#,
#y#)) = Mul(Inv(
#y#), Inv(
#x#))
, and Mul(Inv(
#x#),
#x#) = 1
. If we ignore the last of these equations, you can show that we can normalize to a form that is isomorphic to a list of a disjoint sum of the constants, i.e. [Either Const Const]
in Haskell if Const
were the type of the constant terms. Quotienting this type by the equivalence relation extended with that final equality corresponds to adding the rule that a Left c
cancels out Right c
in the list whenever they are adjacent.
This overall example is a fairly profound one. Almost all of abstract algebra can be viewed as an instance of this or a closely related variation. When you hear about things defined in terms of “generators and relators”, it is an example of this sort. Indeed, those “relators” are used to define a relation that will be extended to an equivalence relation. Being defined in this way is arguably what it means for something to be “algebraic”.
The Introduction to Type Theory section of the NuPRL book provides a more comprehensive and somewhat more formal presentation of these and related concepts. While the quotient type view of quotients is conceptually different from the standard set theoretic presentation, it is much more amenable to computation as the #ZZ // n ZZ# example begins to illustrate.
I don’t believe classical logic is false; I just believe that it is not true.
]]>Knockout is a nice JavaScript library for making values that automatically update when any of their “dependencies” update. Those dependencies can form an arbitrary directed acyclic graph. Many people seem to think of it as “yet another” templating library, but the core idea which is useful far beyond “templating” is the notion of observable values. One nice aspect is that it is a library and not a framework so you can use it as little or as much as you want and you can integrate it with other libraries and frameworks.
At any rate, this article is more geared toward those who have already decided on using Knockout or a library (in any language) offering similar capabilities. I strongly suspect the issues and solutions I’ll discuss apply to all similar sorts of libraries. While I’ll focus on one particular example, the ideas behind it apply generally. This example, admittedly, is one that almost anyone will implement, and in my experience will do it incorrectly the first time and won’t realize the problem until later.
When doing any front-end work, before long there will be a requirement to support “multi-select” of something. Of course, you want the standard select/deselect all functionality and for it to work correctly, and of course you want to do something with the items you’ve selected. Here’s a very simple example:
Item | |
---|---|
Here, the number selected is an overly simple example of using the selected items. More realistically, the selected items will trigger other items to show up and/or trigger AJAX requests to update the data or populate other data. The HTML for this example is completely straightforward.
<div id="#badExample">
Number selected: <span data-bind="text: $data.numberSelected()"></span>
<table>
<tr><th><input type="checkbox" data-bind="checked: $data.allSelected"/></th><th>Item</th></tr>
<!-- ko foreach: { data: $data.items(), as: '$item' } -->
<tr><td><input type="checkbox" data-bind="checked: $data.selected"/></td><td data-bind="text: 'Item number: '+$data.body"></td></tr>
<!-- /ko -->
<tr><td><button data-bind="click: function() { $data.add(); }">Add</button></td></tr>
</table>
</div>
The way nearly everyone (including me) first thinks to implement this is by adding a selected
observable to each item and then having allSelected
depend on all of the selected
observables. Since we also want to write to allSelected
to change the state of the selected
observables we use a writable computed observable. This computed observable will loop through all the items and check to see if they are all set to determine it’s state. When it is updated, it will loop through all the selected
observables and set them to the appropriate state. Here’s the full code listing.
var badViewModel = {
counter: 0,
items: ko.observableArray()
};
badViewModel.allSelected = ko.computed({
read: function() {
var items = badViewModel.items();
var allSelected = true;
for(var i = 0; i < items.length; i++) { // Need to make sure we depend on each item, so don't break out of loop early
allSelected = allSelected && items[i].selected();
}
return allSelected;
},
write: function(newValue) {
var items = badViewModel.items();
for(var i = 0; i < items.length; i++) {
items[i].selected(newValue);
}
}
});
badViewModel.numberSelected = ko.computed(function() {
var count = 0;
var items = badViewModel.items();
for(var i = 0; i < items.length; i++) {
if(items[i].selected()) count++;
}
return count;
});
badViewModel.add = function() {
badViewModel.items.push({
body: badViewModel.counter++,
selected: ko.observable(false)
});
};
ko.applyBindings(badViewModel, document.getElementById('#badExample'));
This should be relatively straightforward, and it works, so what’s the problem? The problem can be seen in numberSelected
(and it also comes up with allSelected
which I’ll get to momentarily). numberSelected
depends on each selected
observable and so it will be fired each time each one updates. That means if you have 100 items, and you use the select all checkbox, numberSelected
will be called 100 times. For this example, that doesn’t really matter. For a more realistic example than numberSelected
, this may mean rendering one, then two, then three, … then 100 HTML fragments or making 100 AJAX requests. In fact, this same behavior is present in allSelected
. When it is written, as it’s writing to the selected
observables, it is also triggering itself.
So the problem is updating allSelected
or numberSelected
can’t be done all at once, or to use database terminology, it can’t be updated atomically. One possible solution in newer versions of Knockout is to use deferredUpdates
or, what I did back in the much earlier versions of Knockout, abuse the rate limiting features. The problem with this solution is that it makes updates asynchronous. If you’ve written your code to not care whether it was called synchronously or asynchronously, then this will work fine. If you haven’t, doing this throws you into a world of shared state concurrency and race conditions. In this case, this solution is far worse than the disease.
So, what’s the alternative? We want to update all selected items atomically; we can atomically update a single observable; so we’ll put all selected items into a single observable. Now an item determines if it is selected by checking whether it is in the collection of selected items. More abstractly, we make our observables more coarse-grained, and we have a bunch of small computed observables depend on a large observable instead of a large computed observable depending on a bunch of small observables as we had in the previous code. Here’s an example using the exact same HTML and presenting the same overt behavior.
Item | |
---|---|
And here’s the code behind this second example:
var goodViewModel = {
counter: 0,
selectedItems: ko.observableArray(),
items: ko.observableArray()
};
goodViewModel.allSelected = ko.computed({
read: function() {
return goodViewModel.items().length === goodViewModel.selectedItems().length;
},
write: function(newValue) {
if(newValue) {
goodViewModel.selectedItems(goodViewModel.items().slice(0)); // Need a copy!
} else {
goodViewModel.selectedItems.removeAll();
}
}
});
goodViewModel.numberSelected = ko.computed(function() {
return goodViewModel.selectedItems().length;
});
goodViewModel.add = function() {
var item = { body: goodViewModel.counter++ }
item.selected = ko.computed({
read: function() {
return goodViewModel.selectedItems.indexOf(item) > -1;
},
write: function(newValue) {
if(newValue) {
goodViewModel.selectedItems.push(item);
} else {
goodViewModel.selectedItems.remove(item);
}
}
});
goodViewModel.items.push(item);
};
ko.applyBindings(goodViewModel, document.getElementById('#goodExample'));
One thing to note is that setting allSelected
and numberSelected
are now both simple operations. A write to an observable triggers a constant number of writes to other observables. In fact, there are only two (non-computed) observables. On the other hand, reading the selected
observable is more expensive. Toggling all items has quadratic complexity. In fact, it had quadratic complexity before due to the feedback. However, unlike the previous code, this also has quadratic complexity when any individual item is toggled. Unlike the previous code, though, this is simply due to a poor choice of data structure. Equipping each item with an “ID” field and using an object as a hash map would reduce the complexity to linear. In practice, for this sort of scenario, it tends not to make a big difference. Also, Knockout won’t trigger dependents if the value doesn’t change, so there’s no risk of the extra work propagating into still more extra work. Nevertheless, while I endorse this solution for this particular problem, in general making finer grained observables can help limit the scope of changes so unnecessary work isn’t done.
Still, the real concern and benefit of this latter approach isn’t the asymptotic complexity of the operations, but the atomicity of the operations. In the second solution, every update is atomic. There are no intermediate states on the way to a final state. This means that dependents, represented by numberSelected
but which are realistically much more complicated, don’t get triggered excessively and don’t need to “compensate” for unintended intermediate values.
We could take the coarse-graining to its logical conclusion and have the view model for an application be a single observable holding an object representing the entire view model (and containing no observables of its own). Taking this approach actually does have a lot of benefits, albeit there is little reason to use Knockout at that point. Instead this starts to lead to things like Facebook’s Flux pattern and the pattern perhaps most clearly articulated by Cycle JS.
]]>For many people interested in type systems and type theory, their first encounter with the literature presents them with this:
#frac(Gamma,x:tau_1 |--_Sigma e : tau_2)(Gamma |--_Sigma (lambda x:tau_1.e) : tau_1 -> tau_2) ->I#
#frac(Gamma |--_Sigma f : tau_1 -> tau_2 \qquad Gamma |--_Sigma x : tau_1)(Gamma |--_Sigma f x : tau_2) ->E#
Since this notation is ubiquitous, authors (reasonably) expect readers to already be familiar with it and thus provide no explanation. Because the notation is ubiquitous, the beginner looking for alternate resources will not escape it. All they will find is that the notation is everywhere but exists in myriad minor variations which may or may not indicate significant differences. At this point the options are: 1) to muddle on and hope understanding the notation isn’t too important, 2) look for introductory resources which typically take the form of $50+ 500+ page textbooks, or 3) give up.
The goal of this article is to explain the notation part-by-part in common realizations, and to cover the main idea behind the notation which is the idea of an inductively defined relation. To eliminate ambiguity and make hand-waving impossible, I’ll ground the explanations in code, in particular, in Agda. That means for each example of the informal notation, there will be how it would be realized in Agda.^{1} It will become clear that I’m am not (just) using Agda as a formal notation to talk about these concepts, but that Agda’s^{2} data type mechanism directly captures them^{3}. The significance of this is that programmers are already familiar with many of the ideas behind the informal notation, and the notation is just obscuring this familiarity. Admittedly, Agda is itself pretty intimidating. I hope most of this article is accessible to those with familiarity with algebraic data types as they appear in Haskell, ML, Rust, or Swift with little to no need to look up details about Agda. Nevertheless, Agda at least has the benefit, when compared to the informal notation, of having a clear place to go to learn more, an unambiguous meaning, and tools that allow playing around with the ideas.
To start, if you are not already familiar with it, get familiar with the Greek alphabet. It will be far easier to (mentally) read mathematical notation of any kind if you can say “Gamma x” rather than “right angle thingy x” or “upside-down L x”.
Using the example from the introduction, the whole thing is a rule. The “|->I|” part is just the name of the rule (in this case being short for “|->| Introduction”). This rule is only part of the definition of the judgment of the form:
#Gamma |--_Sigma e : tau#
The judgment can be viewed as a proposition and the rule is an “if-then” statement read from top to bottom. So the “|->I|” rule says, “if #Gamma, x : tau_1 |--_Sigma e : tau_2#
then #Gamma |--_Sigma lambda x : tau_1.e : tau_1 -> tau_2#
”. It is often profitable to read it bottom-up as “To prove #Gamma |--_Sigma lambda x : tau_1.e : tau_1 -> tau_2#
you need to show #Gamma, x : tau_1 |--_Sigma e : tau_2#
”.
So what is the judgment saying? First, the judgment is, in this case, a four argument relation. The arguments of this relation are #Gamma#, #Sigma#, #e#, and #tau#. We could say the name of this relation is the perspicuous #(_)|--_((_)) (_) : (_)#
. Note that it does not make sense to ask what “⊢” means or what “:” means anymore than it makes sense to ask what “->” means in Haskell’s \ x -> e
.^{4}
In the context of type systems, #Gamma# is called the context, #Sigma# is called the signature, #e# is the term or expression, and #tau# is the type. Given this, I would read #Gamma |--_Sigma e : tau#
as “the expression e has type tau in context gamma given signature sigma.” For the “#->E#” rule we have, additionally, multiple judgements above the line. These are joined together by conjunction, that is, we’d read “#->E#” as "if #Gamma |--_Sigma f : tau_1 -> tau_2#
and #Gamma |--_Sigma x : tau_1#
then #Gamma |--_Sigma f x : tau_2#
In most recent type system research multiple judgments are necessary to describe the type system, and so you may see things like #Gamma |-- e > tau#
or #Gamma |-- e_1 "~" e_2#
. The key thing to remember is that these are completely distinct relations that will have their own definitions (although the collection of them will often be mutually recursively defined).
Relations in set theory are boolean valued functions. Being programmers, and thus constructivists, we want evidence, so a relation |R : A xx B -> bb2| becomes a type constructor R : (A , B) -> Set
. |R(a,b)| holds if we have a value (proof/witness) w : R a b
. An inductively defined relation or judgment is then just a type constructor for an (inductive) data type. That means, if R
is an inductively defined relation, then its definition is data R : A -> B -> Set where ...
. A rule is a constructor of this data type. A derivation is a value of this data type, and will usually be a tree-like structure. As a bit of ambiguity in the terminology (arguably arising from a common ambiguity in mathematical notation), it’s a bit more natural to use the term “judgment” to refer to something that can be (at the meta level) true or false. For example, we’d say |R(a,b)| is a judgment. Nevertheless, when we say something like “the typing judgment” it’s clear that we’re referring to the whole relation, i.e. |R|.
Since a judgment is a relation, we need to describe what the arguments to the relation look like. Typically something like BNF is used. The BNF definitions provide the types used as parameters to the judgments. It is common to use a Fortran-esque style where a naming convention is used to avoid the need to explicitly declare the types of meta-variables. For example, the following says meta-variables like #n#, #m#, and #n_1# are all natural numbers.
n, m ::= Z | S n
BNF definitions translate readily to algebraic data types.
Agda note:
Set
is what is called*
in Haskell. “Type” would be a better name. Also, these sidebars will cover details about Agda with the aim that readers unfamiliar with Agda don’t get tripped up by tangential details.
Sometimes it’s not possible to fully capture the constraints on well-formed syntax with BNF. In other words, only a subset of syntactically valid terms are well-formed. For example, Nat Nat
is syntactically valid but is not well-formed. We can pick out that subset with a predicate, i.e. a unary relation. This is, of course, nothing but another judgment. As an example, if we wired the Maybe
type into our type system, we’d likely have a judgment that looks like #tau\ tt"type"#
which would include the following rule:
#frac(tau\ tt"type")(("Maybe"\ tau)\ tt"type")#
In a scenario like this, we’d also have to make sure the rules of our typing judgment also required the types involved to be well-formed. Modifying the example from the introduction, we’d get:
#frac(Gamma,x:tau_1 |--_Sigma e : tau_2 \qquad tau_1\ tt"type" \qquad tau_2\ tt"type")(Gamma |--_Sigma (lambda x:tau_1.e) : tau_1 -> tau_2) ->I#
As a very simple example, let’s say we wanted to provide explicit evidence that one natural number was less than or equal to another in Agda. Scenarios like this are common in dependently typed programming, and so we’ll start with the Agda this time and then “informalize” it.
data _isLessThanOrEqualTo_ : Nat -> Nat -> Set where
Z<=n : {n : Nat} -> Z isLessThanOrEqualTo n
Sm<=Sn : {m : Nat} -> {n : Nat} -> m isLessThanOrEqualTo n -> (S m) isLessThanOrEqualTo (S n)
Agda notes: In Agda identifiers can contain almost any character so
Z<=n
is just an identifier. Agda allows any identifier to be used infix (or more generally mixfix). The underscores mark where the arguments go. So_isLessThanOrEqualTo_
is a binary infix operator. Finally, curly brackets indicate implicit arguments which can be omitted and Agda will “guess” their values. Usually, they’ll be obvious to Agda by unification.
In the informal notation, the types of the arguments are implied by the naming. n
is a natural number because it was used as the metavariable (non-terminal) in the BNF for naturals. We also implicitly quantify over all free variables. In the Agda code, this quantification was explicit.
#frac()(Z <= n) tt"Z<=n"#
#frac(m <= n)(S m <= S n) tt"Sm<=Sn"#
Again, I want to emphasize that these are defining isLessThanOrEqualTo
and |<=|. They can’t be wrong. They can only fail to coincide with our intuitions or to an alternate definition. A derivation that |2 <= 3| looks like:
In Agda:
twoIsLessThanThree : (S (S Z)) isLessThanOrEqualTo (S (S (S Z)))
twoIsLessThanThree = Sm<=Sn (Sm<=Sn Z<=n)
In the informal notation:
#frac(frac()(Z <= S Z))(frac(S Z <= S (S Z))(S (S Z) <= S (S (S Z)))#
Here’s a larger example that also illustrates that these judgments do not need to be typing judgments. Here we’re defining a big-step operational semantics for the untyped lambda calculus.
x variable
v ::= λx.e
e ::= v | e e | x
In informal presentations, binders like #lambda# are handled in a fairly relaxed manner. While the details of handling binders are tricky and error-prone, they are usually standard and so authors assume readers can fill in those details and are aware of the concerns (e.g. variable capture). In Agda, of course, we’ll need to spell out the details. There are many approaches for dealing with binders with different trade-offs. One of the newer and more convenient approaches is parametric higher-order abstract syntax (PHOAS). Higher-order abstract syntax (HOAS) approaches allow us to reuse the binding structure of the host language and thus eliminate much of the work. Below, this is realized by the Lambda
constructor taking a function as its argument. In a later section, I’ll use a different approach using deBruijn indices.
-- PHOAS approach to binding
mutual
data Expr (A : Set) : Set where
Val : Value A -> Expr A
App : Expr A -> Expr A -> Expr A
Var : A -> Expr A
data Value (A : Set) : Set where
Lambda : (A -> Expr A) -> Value A
-- A closed expression
CExpr : Set1
CExpr = {A : Set} -> Expr A
-- A closed expression that is a value
CValue : Set1
CValue = {A : Set} -> Value A
Agda note:
Set1
is needed for technical reasons that are unimportant. You can just pretend it saysSet
instead. More important is that the definitions ofExpr
andValue
are a bit different than the definition for_isLessThanOrEqualTo_
. In particular, the argument(A : Set)
occurs to the left of the colon. When an argument occurs to the left of the colon we say it parameterizes the data declaration and that it is a parameter. When it occurs to the right of the colon we say it indexes the data declaration and that it is an index. The difference is that parameters must occur uniformly in the return type of the data constructors while indexes can be different in each data constructor. The arguments of an inductively defined relation like_isLessThanOrEqualTo_
will always be indexes (though there could be additional parameters.)
#frac(e_1 darr lambda x.e \qquad e_2 darr v_2 \qquad e[x|->v_2] darr v)(e_1 e_2 darr v) tt"App"#
#frac()(v darr v) tt"Trivial"#
The #e darr v# judgment (read as “the expression #e# evaluates to the value #v#”) defines a call-by-value evaluation relation. #e[x|->v]# means “substitute #v# for #x# in the expression #e#”. This notation is not standardized; there are many variants. In more rigorous presentations this operation will be formally defined, but usually the authors assume you are familiar with it. In the #tt"Trivial"#
rule, the inclusion of values into expressions is implicitly used. Note that the rule is restricted to values only.
The #tt"App"#
rule specifies call-by-value because the #e_2# expression is evaluated and then the resulting value is substituted into #e#. For call-by-name, we’d omit the evaluation of #e_2# and directly substitute #e_2# for #x# in #e#. Whether #e_1# or #e_2# is evaluated first (or in parallel) is not specified in this example.
subst : {A : Set} -> Expr (Expr A) -> Expr A
subst (Var e) = e
subst (Val (Lambda b)) = Val (Lambda (λ a -> subst (b (Var a))))
subst (App e1 e2) = App (subst e1) (subst e2)
data _EvaluatesTo_ : CExpr -> CValue -> Set1 where
EvaluateTrivial : {v : CValue} -> (Val v) EvaluatesTo v
EvaluateApp : {e1 : CExpr} -> {e2 : CExpr}
-> {e : {A : Set} -> A -> Expr A}
-> {v2 : CValue} -> {v : CValue}
-> e1 EvaluatesTo (Lambda e)
-> e2 EvaluatesTo v2
-> (subst (e (Val v2))) EvaluatesTo v
-> (App e1 e2) EvaluatesTo v
The EvaluateTrivial
constructor explicitly uses the Val
injection of values into expressions. The EvaluateApp
constructor starts off with a series of implicit arguments that introduce and quantify over the variables used in the rule. After those, each judgement above the line in the #tt"App"#
rule, becomes an argument to the EvaluateApp
constructor.
In this case ↓ is defining a functional relation, meaning for every expression there’s at most one value that the expression evaluates to. So another natural way to interpret ↓ is as a definition, in logic programming style, of a (partial) recursive function. In other words we can use the concept of mode from logic programming and instead of treating the arguments to ↓ as inputs, we can treat the first as an input and the second as an output.
↓ gives rise to a partial function because not every expression has a normal form. For _EvaluatesTo_
this is realized by the fact that we simply won’t be able to construct a term of type e EvaluatesTo v
for any v
if e
doesn’t have a normal form. In fact, we can use the inductive structure of the relationship to help prove that statement. (Unfortunately, Agda doesn’t present a very good experience for data types indexed by functions, so the proof is not nearly as smooth as one would like.)
Next we’ll turn to type systems which will present an even larger example, and will introduce some concepts that are specific to type systems (though, of course, they overlap greatly with concepts in logic due to the Curry-Howard correspondence.)
Below is an informal presentation of the polymorphic lambda calculus with explicit type abstraction and type application. An interesting fact about the polymorphic lambda calculus is that we don’t need any base types. Via Church-encoding, we can define types like natural numbers and lists.
α type variable
τ ::= τ → τ | ∀α. τ | α
x variable
c constant
v ::= λx:τ.e | Λτ.e | c
e ::= v | e e | e[τ] | x
In this case I’ll be using deBruijn indices to handle the binding structure of the terms and types. This means instead of writing |lambda x.lambda y. x|
, you would write |lambda lambda 1|
where the |1| counts how many binders (lambdas) you need to traverse to reach the binding site.
data TType : Set where
TTVar : Nat -> TType -- α
_=>_ : TType -> TType -> TType -- τ → τ
Forall : TType -> TType -- ∀α. τ
mutual
data TExpr : Set where
TVal : TValue -> TExpr -- v
TApp : TExpr -> TExpr -> TExpr -- f x
TTyApp : TExpr -> TType -> TExpr -- e[τ]
TVar : Nat -> TExpr -- x
data TValue : Set where
TLambda : TType -> TExpr -> TValue -- λx:τ.e
TTyLambda : TExpr -> TValue -- Λτ.e
TConst : Nat -> TValue -- c
In formulating the typing rules we need to deal with open terms, that is terms which refer to variables that they don’t bind. This should only happen if some enclosing terms did bind those variables, so we need to keep track of the variables that have been bound by enclosing terms. For example, when type checking |lambda x:tau.x|
, we’ll need to type check the subterm |x| which does not contain enough information in itself for us to know what the type should be. So, we keep track of what variables have been bound (and to what type) in a context and then we can just look up the expected type. When authors bother formally spelling out the context, it will look something like the following:
Γ ::= . | Γ, x:τ
Δ ::= . | Δ, α
We see that this is just a (snoc) list. In the first case, |Gamma|, it is a list of pairs of variables and types, i.e. an association list mapping variables to types. Often it will be treated as a finite mapping. In the second case, |Delta|, it is a list of type variables. Since I’m using deBruijn notation, there are no variables so we end up with a list of types in the first case. In the second case, we would end up with a list of nothing in particular, i.e. a list of unit, but that is isomorphic to a natural number. In other words, the only purpose of the type context, |Delta|, is to make sure we don’t use unbound variables, which in deBruijn notation just means we don’t have deBruijn indexes that try to traverse more lambdas than enclose them. The Agda code for the above is completely straight-forward.
data List (A : Set) : Set where
Nil : List A
_,_ : List A -> A -> List A
Context : Set
Context = List TType
TypeContext : Set
TypeContext = Nat
Signatures keep track of what primitive, “user-defined” constants might exist. Often the signature is omitted since nothing particularly interesting happens with it. Indeed, that will be the case for us. Nevertheless, we see that the signature is just another association list mapping constants to types.
Σ ::= . | Σ, c:τ
The main reason I included the signature, beyond just covering it for the cases when it is included, is that sometimes certain rules can be better understood as manipulations of the signature. For example, in logic, universal quantification is often described by a rule like:
#frac(Gamma |-- P[x|->c] \qquad c\ "fresh")(Gamma |-- forall x.P)#
What’s happening and what “freshness” is is made a bit clearer by employing a signature (which for logic is usually just a list of constants similar to our TypeContext
):
#frac(Gamma |--_(Sigma, c) P[x|->c] \qquad c notin Sigma)(Gamma |--_Sigma forall x.P)#
To define the typing rules we need two judgements. The first, #Delta |-- tau#
, will be a simple judgement that says |tau| is a well formed type in |Delta|. This basically just requires that all variables are bound.
#frac(alpha in Delta)(Delta |-- alpha)#
#frac(Delta, alpha |-- tau)(Delta |-- forall alpha. tau)#
#frac(Delta |-- tau_1 \qquad Delta |-- tau_2)(Delta |-- tau_1 -> tau_2)#
The Agda is
data _<_ : Nat -> Nat -> Set where
Z<Sn : {n : Nat} -> Z < S n
Sn<SSm : {n m : Nat} -> n < S m -> S n < S (S m)
data _isValidIn_ : TType -> TypeContext -> Set where
TyVarJ : {n : Nat} -> {ctx : TypeContext} -> n < ctx -> (TTVar n) isValidIn ctx
TyArrJ : {t1 t2 : TType} -> {ctx : TypeContext} -> t1 isValidIn ctx -> t2 isValidIn ctx -> (t1 => t2) isValidIn ctx
TyForallJ : {t : TType} -> {ctx : TypeContext} -> t isValidIn (S ctx) -> (Forall t) isValidIn ctx
The meat is the following typing judgement, depending on the judgement defining well-formed types. I’m not really going to explain these rules because, in some sense, there is nothing to explain. Beyond explaining the notation itself, which was the point of the article, the below is “self-explanatory” in the sense that it is a definition, and whether it is a good definition or “meaningful” depends on whether we can prove the theorems we want about it.
#frac(c:tau in Sigma \qquad Delta |-- tau)(Delta;Gamma |--_Sigma c : tau) tt"Const"#
#frac(x:tau in Gamma \qquad Delta |-- tau)(Delta;Gamma |--_Sigma x : tau) tt"Var"#
#frac(Delta;Gamma |--_Sigma e_1 : tau_1 -> tau_2 \qquad Delta;Gamma |--_Sigma e_2 : tau_1)(Delta;Gamma |--_Sigma e_1 e_2 : tau_2) tt"App"#
#frac(Delta;Gamma |--_Sigma e : forall alpha. tau_1 \qquad Delta |-- tau_2)(Delta;Gamma |--_Sigma e[tau_2] : tau_1[alpha|->tau_2]) tt"TyApp"#
#frac(Delta;Gamma, x:tau_1 |--_Sigma e : tau_2 \qquad Delta |-- tau_1)(Delta;Gamma |--_Sigma (lambda x:tau_1.e) : tau_1 -> tau_2) tt"Abs"#
#frac(Delta, alpha;Gamma |--_Sigma e : tau)(Delta;Gamma |--_Sigma (Lambda alpha.e) : forall alpha. tau) tt"TyAbs"#
Here’s the corresponding Agda code. Note, all Agda is doing for us here is making sure we haven’t written self-contradictory nonsense. In no way is Agda ensuring that this is the “right” definition. For example, it could be the case (but isn’t) that there are no values of this type. Agda would be perfectly content to let us define a type that had no values.
tySubst : TType -> TType -> TType
tySubst t1 t2 = tySubst' t1 t2 Z
where tySubst' : TType -> TType -> Nat -> TType
tySubst' (TTVar Z) t2 Z = t2
tySubst' (TTVar Z) t2 (S _) = TTVar Z
tySubst' (TTVar (S n)) t2 Z = TTVar (S n)
tySubst' (TTVar (S n)) t2 (S d) = tySubst' (TTVar n) t2 d
tySubst' (t1 => t2) t3 d = tySubst' t1 t3 d => tySubst' t2 t3 d
tySubst' (Forall t1) t2 d = tySubst' t1 t2 (S d)
data _isIn_at_ {A : Set} : A -> List A -> Nat -> Set where
Found : {a : A} -> {l : List A} -> a isIn (l , a) at Z
Next : {a : A} -> {b : A} -> {l : List A} -> {n : Nat} -> a isIn l at n -> a isIn (l , b) at (S n)
data _hasType_inContext_and_given_ : TExpr -> TType -> Context -> TypeContext -> Signature -> Set where
ConstJ : {t : TType} -> {c : Nat}
-> {Sigma : Signature} -> {Gamma : Context} -> {Delta : TypeContext}
-> t isIn Sigma at c
-> t isValidIn Delta
-> (TVal (TConst c)) hasType t inContext Gamma and Delta given Sigma
VarJ : {t : TType} -> {x : Nat}
-> {Sigma : Signature} -> {Gamma : Context} -> {Delta : TypeContext}
-> t isIn Gamma at x
-> t isValidIn Delta
-> (TVar x) hasType t inContext Gamma and Delta given Sigma
AppJ : {t1 : TType} -> {t2 : TType} -> {e1 : TExpr} -> {e2 : TExpr}
-> {Sigma : Signature} -> {Gamma : Context} -> {Delta : TypeContext}
-> e1 hasType (t1 => t2) inContext Gamma and Delta given Sigma
-> e2 hasType t1 inContext Gamma and Delta given Sigma
-> (TApp e1 e2) hasType t2 inContext Gamma and Delta given Sigma
TyAppJ : {t1 : TType} -> {t2 : TType} -> {e : TExpr}
-> {Sigma : Signature} -> {Gamma : Context} -> {Delta : TypeContext}
-> e hasType (Forall t1) inContext Gamma and Delta given Sigma
-> t2 isValidIn Delta
-> (TTyApp e t2) hasType (tySubst t1 t2) inContext Gamma and Delta given Sigma
AbsJ : {t1 : TType} -> {t2 : TType} -> {e : TExpr}
-> {Sigma : Signature} -> {Gamma : Context} -> {Delta : TypeContext}
-> e hasType t2 inContext (Gamma , t1) and Delta given Sigma
-> (TVal (TLambda t1 e)) hasType (t1 => t2) inContext Gamma and Delta given Sigma
TyAbsJ : {t : TType} -> {e : TExpr}
-> {Sigma : Signature} -> {Gamma : Context} -> {Delta : TypeContext}
-> e hasType t inContext Gamma and (S Delta) given Sigma
-> (TVal (TTyLambda e)) hasType (Forall t) inContext Gamma and Delta given Sigma
Here’s a typing derivation for the polymorphic constant function:
tyLam : TExpr -> TExpr
tyLam e = TVal (TTyLambda e)
lam : TType -> TExpr -> TExpr
lam t e = TVal (TLambda t e)
polyConst
: tyLam (tyLam (lam (TTVar Z) (lam (TTVar (S Z)) (TVar (S Z))))) -- Λs.Λt.λx:t.λy:s.x
hasType (Forall (Forall (TTVar Z => (TTVar (S Z) => TTVar Z)))) -- ∀s.∀t.t→s→t
inContext Nil and Z
given Nil
polyConst = TyAbsJ (TyAbsJ (AbsJ (AbsJ (VarJ (Next Found) (TyVarJ Z<Sn))))) -- written by Agda
data False : Set where
Not : Set -> Set
Not A = A -> False
wrongType
: Not (tyLam (lam (TTVar Z) (TVar Z)) -- Λt.λx:t.x
hasType (Forall (TTVar Z)) -- ∀t.t
inContext Nil and Z
given Nil)
wrongType (TyAbsJ ())
Having written all this, we have not defined a type checking algorithm (though Agda’s auto
tactic does a pretty good job); we’ve merely specified what evidence that a program is well-typed is. Explicitly, a type checking algorithm would be a function with the following type:
data Maybe (A : Set) : Set where
Nothing : Maybe A
Just : A -> Maybe A
typeCheck : (e : TExpr) -> (t : TType) -> (sig : Signature) -> Maybe (e hasType t inContext Nil and Z given sig)
typeCheck = ?
In fact, we’d want to additionally prove that this function never returns Nothing
if there does exist a typing derivation that would give e
the type t
in signature sig
. We could formalize this in Agda by instead giving typeCheck
the following type:
data Decidable (A : Set) : Set where
IsTrue : A -> Decidable A
IsFalse : Not A -> Decidable A
typeCheckDec : (e : TExpr) -> (t : TType) -> (sig : Signature) -> Decidable (e hasType t inContext Nil and Z given sig)
typeCheckDec = ?
This type says that either typeCheckDec
will return a typing derivation, or it will return a proof that there is no typing derivation. As the name Decidable
suggests, this may not always be possible. Which is to say, type checking may not always be decidable. Note, we can always check that a typing derivation is valid — we just need to verify that we applied the rules correctly — what we can’t necessarily do is find such a derivation given only the expression and the type or prove that no such derivation exists. Similar concerns apply to type inference which could have one of the following types:
record Σ (T : Set) (F : T -> Set) : Set where
field
fst : T
snd : F fst
inferType : (e : TExpr) -> (sig : Signature) -> Maybe (Σ TType (λ t → e hasType t inContext Nil and Z given sig))
inferType = ?
inferTypeDec : (e : TExpr) -> (sig : Signature) -> Decidable (Σ TType (λ t → e hasType t inContext Nil and Z given sig))
inferTypeDec = ?
where Σ indicates a dependent sum, i.e. a pair where the second component (here of type e hasType t inContext Nil and Z given sig
) depends on the first component (here of type TType
). With type inference we have the additional concern that there may be multiple possible types an expression could have, and we may want to ensure it returns the “most general” type in some sense. There may not always be a good sense of “most general” type and user-input is required to pick out of the possible types.
Sometimes the rules themselves can be viewed as the defining rules of a logic program and thus directly provide an algorithm. For example, if we eliminate the rules, types, and terms related to polymorphism, we’d get the simply typed lambda calculus. A Prolog program to do type checking can be written in a few lines with a one-to-one correspondence to the type checking rules (and, for simplicitly, also omitting the signature):
lookup(z, [T|_], T).
lookup(s(N), [_|Ctx],T) :- lookup(N, Ctx, T).
typeCheck(var(N), Ctx, T) :- lookup(N, Ctx, T).
typeCheck(app(F,X), Ctx, T2) :- typeCheck(F, Ctx, tarr(T1, T2)), typeCheck(X, Ctx, T1).
typeCheck(lam(B), Ctx, tarr(T1, T2)) :- typeCheck(B, [T1|Ctx], T2).
This Agda file contains all the code from this article.↩
Most dependently typed languages, such as Coq or Epigram would also be adequate.↩
See this StackExchange answer for more discussion of this.↩
Consider a 2x2 square centered at the origin. In each quadrant place circles as big as possible so that they fit in the square and don’t overlap. They’ll clearly have radius 1/2. See the image below. The question now is what’s the radius of the largest circle centered at the origin that doesn’t overlap the other circles.
It’s clear from symmetry that the inner circle is going to touch all the other circles at the same time, and it is clear that it is going to touch along the line from the origin to the center of one of the outer circles. So the radius of the inner circle, #r#, is just the distance from the origin to the center of one of the outer circles minus the radius of the outer circle, namely 1/2. As an equation:
#r = sqrt(1/2^2 + 1/2^2) - 1/2 = sqrt(2)/2 - 1/2 ~~ 0.207106781#
Now if we go to three dimensions we’ll have eight circles instead of four, but everything else is the same except the distances will now be #sqrt(1/2^2 + 1/2^2 + 1/2^2)#
. It’s clear that the only difference for varying dimensions is that in dimension #n# we’ll have #n# #1/2^2# terms under the square root sign. So the general solution is easily shown to be:
#r = sqrt(n)/2 - 1/2#
You should be weirded out now. If you aren’t, here’s a hint: what happens when #n = 10#? Here’s another hint: what happens as #n# approaches #oo#?
3blue1brown has a video describing this example and presenting one way of regaining intuition about it.
]]>