Scala-json-ast SP Proposal

mdedetrich · November 30, 2016, 7:08pm

Original SLIP Thread: https://github.com/scala/slip/pull/28
Gitter Channel: https://gitter.im/mdedetrich/scala-json-ast

This proposal is about adding a standard Scala Json AST module to the SP. The goal of the scala-json-ast is as follows. The current implementation can be viewed at https://github.com/mdedetrich/scala-json-ast (easily read version can be read here https://github.com/mdedetrich/slip/blob/274307b68a13aa242ead7ee7fdb0d2fd3e6bc45f/text/json-ast.md)

Current State of Ecosystem
Currently Scala is an unfortunate position where it has roughly 6 competing JSON AST, these JSON AST’s usually differ in subtle ways. Due to not having an establised way to represent a JSON AST, libraries that need to work with JSON often have to write wrappers for every single JSON library out there. This becomes very problematic considering that each of these JSON libraries have their own lifecycles and often have their own dependencies/obligations. The effects of this can be seen in libraries like slick-pg https://github.com/tminglei/slick-pg

The lack of a standard JSON AST is creating problems similar to what exists in Haskell ecosystem in regards to String (i.e. multiple competing String types which libraries/frameworks have to deal with). The idea is to introduce 2 JSON types, scala.json.ast.JValue and scala.json.ast.unsafe.JValue which can be passed around by various frameworks and libraries.

Goals

Strictly only AST’s. This proposal does not propose a JSON parser, nor guidelines on how to handle errors. This is up to both web frameworks and JSON parsers as they see fit.
Provides 2 AST’s
One is for idiomatic Scala use. Contains immutable data structures with predictable performance (including key lookup for JSON objects and index lookup for Json Arrays). Also uses the recommended data structures according to the JSON standard https://tools.ietf.org/html/rfc7159, i.e. JSON objects is an unordered Map with unique values per key. Also mandates validation so its guaranteed to contain a valid JSON value if the value exists
One for atypical Scala use (called unsafe). Contains high performance data structures (mutable and immutable) as well as the ability to handle specific cases (i.e. ordering for Maps as well as supporting multiple values per key in cases of serialization).
Ability to seamlessly convert from standard to unsafe and vice versa
Scala 2.10, 2.11 and 2.12 support as well as Scala.js support.
For Scala.js, uses the appropriate underlying Javascript data structures for the unsafe AST to maintain performance (i.e. uses js.Array instead of Scala Array)
Designed to be ultra stable and to never change unless critical issues are found
Strictly zero dependencies
Already part of the scala platform continuous integration as well as the sbt-platform-plugin
Comprehensive test suite for all platforms, including benchmarks

History and community adoption
The original idea to create a common JSON dates back years when Json4s, Play and Spray attempted to collaborate to create a standard JSON AST that they could share. Unfortunately the person pushing for the common AST no longer does Scala work, and since then the proposal has laid dormant until it was revived. Initially the revival was to make the JSON AST to be part of the Json4s project however since the introduction of SLIP’s it was decided that making the JSON AST to be part of the Scala Library (and now the Scala Platform as a module) was better idea for achieving its goals.

Numerous libraries and framework authors have signalled their interests for adopting the JSON AST in its current form, this includes

All lightbend/typesafe libraries+ framworks, i.e. Play, Akka-Http, Lagom, Akka see https://github.com/scala/slip/pull/28#issuecomment-263517532 for most recent confirmation
SBT 1.0 uses the JSON library as part of their communication. See https://gitter.im/scala/slip?at=583eca7d73abd79c55b5db53 and http://eed3si9n.com/sjson-new-and-the-prisoner-of-azkaban
Lift https://gitter.im/mdedetrich/scala-json-ast?at=56d9b37968ddef776468a0bf
Jawn - will add scala-json-ast to one of their supported libraries. Still a question whether Jawn will use the unsafe AST internally for parsing
Circe - Considering adding interopt for the Scala Json AST
Play Json pull request which uses Scala Json AST rather than play-json https://github.com/mdedetrich/playframework. Demonstrates integration in release frameworks/libraries

Note that these are the most recent references, you can see more older ones at the SLIP here https://github.com/mdedetrich/slip/blob/274307b68a13aa242ead7ee7fdb0d2fd3e6bc45f/text/json-ast.md#references---quotes-from-gitter

The development of the Sclaa Json AST was also done with collaboration within the Scala community for over a year and as far as I know, most major concerns with the actual design of the Scala JSON AST were (at worst) adequately resolved.

Ichoran · December 4, 2016, 7:36pm

I don’t really like the name “unsafe”. As I understand the implementations, if you have multiple identical keys in a map, “unsafe” will safely preserve them, while the presumably safe alternative will not, despite it being permitted (thought not recommended JSON). I think a word more like “direct” would be better: it emphasizes that it more directly reflects what’s actually in the file and tries to impose minimal overhead. (I am not sure what the contrast would be; something like “tidy”?)

I do think the naming is important if this is to be widely adopted. People will often choose based on the name alone without much deeper understanding, so the name should be at least not misleading. If “unsafe” is actually unsafe in terms of throwing lots of exceptions, I would reconsider its design. If it just doesn’t perform every “I am the nicest kind of JSON” check for you, I would pick a name that didn’t make it sound so much like you might accidentally overwrite random memory locations.

xring · December 5, 2016, 8:41am

Agree with @Ichoran 's thoughts about name unsafe, it is high possibility that beginners will confused by this name.

mdedetrich · December 8, 2016, 10:55am

I am less concerned about the actual naming of the unsafe AST. Actually it originally was meant to be called fast, but we had a big debate about the name around a year ago and then settled on the name unsafe, this is the reasoning why

Some lookup operations on the AST are unsafe in regards to performance characteristics. unsafe.JObject has to iterate through the underlying array to lookup by key, where as the expected behavior is to have eC (effective constant) lookup time for Map type objects
The underlying datastructures for unsafe.JArray and unsafe.JObject are standard mutable arrays, which can lead to surprises if not used properly.
More generally it allows for behaviour which isn’t deemed as “sane” when dealing with JSON (although it can be allowed by the spec). For example, you can have multiple values per unique key in a unsafe.JObject (since its backed by an array of Key/Value). Likewise the keys for the unsafe.JObject are ordered. When typically dealing with JSON you should never assume that keys are ordered and also a JSON object should behave as a normal Map, i.e. one value per key
Due to speed it also doesn’t validate things, for example you can have NaN in a unsafe.JNumber (which isn’t allowed by the spec).

All of these things do imply the AST is “unsafe”. Regarding your point about scaring beginners away, I actually think that is part of the intent. Due to us ending up with 2 AST’s in the first place (which was quite a heated argument at the time and only done after a lot of deliberation), we wanted to signal to beginners what AST you should use (which should be the standard, and not the unsafe). This was one of the main issues about save vs fast (which is what the original design had), it was confusing to people about “what I should use”. At least with this design, there is a signal to new users about what they should use (and yes, they should almost always use the standard AST).

You could argue that an APT analogy is String vs Array[Byte] or Array[Char]. The former is the safe version, which you should almost always be using, where as the later is what you would typically be using for low end/performance related stuff.

To conclude, I am not against renaming it to direct (or some other name), I am just listing the reasons why it is unsafe right ow

I do think the naming is important if this is to be widely adopted. People will often choose based on the name alone without much deeper understanding, so the name should be at least not misleading. If “unsafe” is actually unsafe in terms of throwing lots of exceptions, I would reconsider its design.

Actually it was deliberately designed this way because it filled a lot of practical use cases which the safe version of the AST doesn’t (you can also blame this situation on the crappily defined JSON spec which is another argument). There are some rare, but very important use cases which called for an “unsafe” design

SBT 1.0 (which has server functionality) needs to very efficiently check if 2 JSON AST’s are the same by generating a hash out of both. If you lose ordering in JObject, this isn’t possible to do. They also do need the speed
If you are reading JSON from a source where its always going to be valid JSON (i.e. a postgres JSON column will always give you valid JSON, if it doesn’t then you have much bigger problems), and all you want to do is to feed that to another source in a high performance situation, you generally don’t need to do all of this validation
If you are writing a JSON parser (actually the design of unsafe is somewhat inspired by jawns internal JSON AST)

Honestly if you want to provide something practical that also adhears to the JSON spec in a sane way, you either need to provide a very complex/non trivial implemetation with its own custom data structures and number types to handle all of the possible corner cases (i.e. what Circe does) or you need to have one default implementation which is “sane” and what people generally should use, and another which is more for the other cases, which is what this implementation does.

Ichoran · December 8, 2016, 6:05pm

That still sounds more like “unrefined” than “unsafe” to me.

Also, the JSON spec doesn’t say what you can do with a JNumber in the in-memory representation. It just won’t let you write out NaN. One can always fall back to null for infinite at not-number numbers.

And if SBT requires ordering of JSON objects, I’m rather terrified. This is no way to represent an AST. JSON has a perfectly usable construct for the case when order matters: an array. But the point that some program may require it is well-taken. SBT should be fixed, though. If order matters, use array. If you want to hash a map, use an order-independent hash.

mdedetrich · December 8, 2016, 6:23pm

You can argue its unrefined because the spec is unrefined. Like I said I am open to changing unsafe to something else, but with the peculiarities of the English language, unsafe at least so far has been the most accurate description in regards to its usage and goals

Sure, that was just a specific example, that unsafe.JNumber allows you to write anything into the datastructure, including null or NaN

SBT actually needs it for caching and performance reasons, it doesn’t actually rely on JSON objects having ordering for any kind of business logic, but this feature is needed to cache JSON (and then to figure out if the JSON has changed or not). Here is a direct quote from Eugene Yokota

You can see the original issue here reboot - Map and key insertion order · Issue #8 · json4s/json4s-ast · GitHub

julienrf · December 12, 2016, 1:44pm

Hi Matthew,

I think the platform definitely deserves a better JSON library than the one that was in the standard library. Your library is interesting and mostly great, however I have a few critics about your proposal:

I don’t think the platform should have both the safe and unsafe ASTs. The goal of the platform is to provide libraries supporting the main use cases. The description you gave of the unsafe AST makes me think of advanced use cases.
I am not sure that using a String to represent numbers is the best approach. When pattern matching on a JNumber I would expect to get a numeric value. I don’t think the converters you provide are a good idea because they can throw exceptions. Maybe we should draw inspiration from circe’s BiggerDecimal, or at least provide an extractor that returns a BigDecimal ;
I think we not only need an AST but also a JSON parser and a printer ;
For simplicity we should probably share the same safe AST code source between Scala.js and JVM.

mdedetrich · December 13, 2016, 3:12pm

Yup, although there are a lot of examples of these use cases existing in other standard libraries. i.e. you can argue that dealing with Byte[Array] or Byte[Char] isn’t something that people should typically deal with, this functionality still exists in the stdlib of almost every major language (and this isn’t the stdlib, this is just the scala platform).

I personally think there is benefit in having the unsafe as part of this package, libraries such as SBT have demonstrated there is use for this[quote=“julienrf, post:7, topic:175”]
I am not sure that using a String to represent numbers is the best approach. When pattern matching on a JNumber I would expect to get a numeric value. I don’t think the converters you provide are a good idea because they can throw exceptions. Maybe we should draw inspiration from circe’s BiggerDecimal, or at least provide an extractor that returns a BigDecimal ;
[/quote]

I am not against using a special number type (such as Circe’s BiggerDecimal) however I wanted to keep the implementation as lean and as trivial as possible, and use a validated String. Maintaining a custom number implementation isn’t something thats easy (for similar reasons this is also why there is a separate implementation called unsafe which handles multiple values per key + ordering for JsonObj, I don’t want to maintain a new custom datastructure that handles all of the corner cases of JSON object).

Also note that the AST is meant to be abstract by other libraries themselves, so some library using spire (for example) would use their own real type rather than a BiggerDecimal (for example). I would be more open to using a custom number type if it was part of stdlib, but unfortunately BigDecimal has known issues and I don’t think that maintaining a custom number implementation completely adhers to the goals of the AST (although I am open to change my mind on this).

Regarding the number converters, I am actually on the fence for their design myself. I have asked numerous times for feedback on the converters and your actually the first person to address this directly. The main emphasis of the current design of the number converters is for easy extensibility (i.e. for other custom number types).

Also regarding the note about throwing exceptions, I think that this is more of a style matter. Converting to a specific number type should not fail at all (at least for the standard ast) considering that all numbers are validated by a regex. You may lose precision when converting to something like a Double, but I believe this is something to be expected since we are dealing with numbers in the first place.

Agreed, but one step at a time. JSON parsers are a lot harder than just the AST’s, and they tend to have tradeoffs that are far less negligable than the AST itself. Its going to be easier to write a parser once we have settled on the AST.

If possible, sure. The separations that now exist are mainly for unsafe (unsafe uses js.Array instead of Array. This is deliberate both due to performance and also so it doesn’t pull in scala’s stdlib). There is also custom implementations of stuff like equals and hashcode for the different platforms, and also performance/platform related functions (i.e. toJsAny isn’t something that makes sense on the JVM).

I don’t think this is a big concern, the implementation of the scala json AST is trivial (and this shouldn’t really change) and the differences are there for quality of life reasons

julienrf · December 13, 2016, 3:34pm

What does that mean? Can you elaborate?

mdedetrich · December 13, 2016, 4:05pm

It isn’t expected that libraries will expose the JSON AST type directly for users to work on (although that can definitely happen in some cases). An example would be the pull request I did against play-json which replaces their internal AST with mine (you can see it here GitHub - mdedetrich/playframework at scalaJsonAstIntegration). In this case, Plays exposed public API for working with JSON AST is exactly the same, the difference is that they are now working on an external AST (this being scala-json-ast). Implicit classes as well as type aliasing was used to achieve this.

Of course, this all depends on how the other json libraries/frameworks integrate with scala-json-ast. The goal that we are trying to achieve is that they all work on the same JValue type, so that libraries can freely pass around JValue types without having to worry about interopterability issues.

julienrf · December 14, 2016, 3:54pm

I don’t think we want libraries to hide this “external” AST as you did (I’m not in favor of using implicit class to enrich the AST).

I understand the argument of getting better interoperability with a common AST, though.

mdedetrich · December 14, 2016, 11:00pm

I wouldn’t say “hide” is the right word (its the same type after all), just that its expected that some libraries will do this.

The point I was trying to make is that the AST is deliberately designed to be “light” so that its easier to integrate with current JSON libraries/web frameworks.

Thats one of the primary motivations. At work we have to deal with 3 JSON types (json4s, play-json and circe-json) and we have actually been stuck on specific versions of libraries due to binary compatibility issues (since all of these JSON AST’s/libraries have their own lifecycles).

tpolecat · December 31, 2016, 12:51am

This library does not have widespread use, as is required for platform modules as I understand things. Accepting the argument that it would gain widespread use if only it were part of the platform does not seem like a good precedent. If the library cannot gain use by its own merit then I don’t think it should be considered for the platform.

SethTisue · December 31, 2016, 3:23pm

tpolecat, why wouldn’t it be good enough that big players are lined up who say they will use the new library once it becomes official? (and not begrudgingly either, they like it). That’s precisely the situation here.

This library wasn’t born yesterday. I don’t think it’s accurate or fair to talk about it as if it were some brand-new, untried thing that someone just randomly dreamed up. The community has been iterating and incubating on this for years, and what Matthew has done is a relatively minor variation on existing code that’s been around for years and in widespread use for years.

Of course something totally new and untried thing wouldn’t just be ushered into the platform right away.

tpolecat · December 31, 2016, 11:17pm

How many library maintainers have endorsed this proposal here? I don’t see any.

SethTisue · January 1, 2017, 2:15am

Matthew’s original post all the way at the top of this thread already contains detailed substantiation of my statement, complete with links/references. I don’t think it’s necessary to make all the library authors come here and restate their support yet again.

ktoso · January 2, 2017, 9:51am

I agree with Seth, that argument would only again derail the discussion once more.

We (meaning: akka team, spray team, play team, lagom team) would all want to support a common AST like this and have voiced this multiple times. Please note that the feature here is interop, that we could provide tools in our libs, that any other lib that can emit this AST would work with. Making it easier to use any other library, say Argonaut, if someone would like to, while still using the tools (say, testing support) provided in the core of these projects (e.g. akka http could add testing support, and would only ask for a conversion into the shared AST, much simpler than nowadays).

The problem is not the “adopting” - that’s simple. The timing of that move though is the hard thing and by this proposal slipping for so long, it has missed a few trains for us already… Since we’re bound by binary compatibility and it would not help anyone if we adopt, get it out there, and then tell all our users to completely change their libs again since the package name changed (since it might, when going into the platform).

mdedetrich · January 2, 2017, 10:24am

This is what the biggest complaint I have heard so far from library authors. The AST proposal has existed for years, and it has been delayed for various reasons and the biggest frustration has been the constant delays and bike shedding.

The main goal/s of the library is to provide interopt and binary stability. For other libraries to actually get these benefits, the library has to be released properly in the platform. A pre-released version of the library already exists, but library authors are not going to put that into their actual stable release because it means they will have to break stability/compatibility goal since they will need to rechange their dependency once its actually released in the platform.

I would also like to have some more technical discussions about the library rather than having these arguments that would derail everything (once again). @julienrf earlier posted some great feedback about the library and I would like to have some more feedback of this sought. So far the library authors have more or less been happy with the current design (its not entirely possible to completely please everyone), but I suspect with these constant delays that have been happening a lot of enthusiasm has been lost on the way.

martijnhoekstra · January 2, 2017, 4:24pm

Does this imply that for a library to become part of the platform, it requires a separate platform release?

mdedetrich · January 2, 2017, 5:12pm

For general, I can’t say but for this particular case there isn’t a seperate release. Rather the current release is the draft proposal and then assuming the SLIP is approved, for this case it may just be a matter of removing the milestone from the version and releasing that way, or it could be more complex than that (I am not sure what the publishing requirements for the SP are).

At least personally I don’t want to maintain a seperate release for the SP, and that is against the design goals of the SLIP in the first place.