marcus_holmes 3 months ago

I tried a similar approach in a previous startup - treat data as documents and store in a JSONB field.

Postgres was awesome and handled this brilliantly, but the lack of schema and typing killed it. We just ended up fighting data quality the whole time. We couldn't assume that any document had all the required fields, or that they were in a format that made sense e.g. the Price column sometimes had currency symbols, and sometimes commas-and-periods in UK/US format and sometimes in Euro format - sorting by Price involved some complicated parsing of all the records first.

We moved back to relational tables.

I won't say I'd never do this again, but I would definitely not just throw JSON documents to a database and expect good things to happen.

  • aden1ne 3 months ago

    Had a similar experience. What killed it for me, is that no statistics can be gathered for JSONB columns. This in turn really messes with the query planner once you do something like `select a.* from a join b on a.id = b.a_id where b.my_jsonb_column ->> 'foo' = 'bar';`.

    Given the lack of statistics, the query planner loves going for a nested loop rather than hash or merge join where those would appropriate, leading to abysmal performance.

    There is an thread[0] on the PostgreSQL mailing list to add at least some statistics on JSONB column, but this has gone nowhere since 2022.

    [0]: https://www.postgresql.org/message-id/flat/c9c4bd20-996c-100...

    • jeffffff 3 months ago

      it's kind of dumb that postgres uses a nested loop join instead of a hash join there. hash join almost always has the best worst-case behavior, and without stats it should be the default choice.

  • winternewt 3 months ago

    If you have schema requirements, why _not_ do it with tables and columns instead? The point of the jsonb column would be precisely to store arbitrary stuff that has no schema.

    • bruce343434 3 months ago

      I usually see this (dynamic scripting langs, schemaless databases) play out as a very good environment to prototype in, very few barriers to change things or make stuff up as you go. Then the prototype is not discarded but "iterated" to form the actual production code. The (now needlessly so) flexible JSON store is grand-fathered in.

      I have yet to come across an actual use case for JSON stores in a production app with an established design. What the hell do you mean you have no idea what the data might hold?? Why are you storing unknown, undefined, undefinable things??? Or perhaps, there actually is a schema i.e. fields we rely on being present, but we were too lazy to define it formally?

      • eropple 3 months ago

        Well, there are a decent number of those use cases, but the one I find most compelling is storing an exact record of what another service sent me in case I need it later. I pull out and store in my typed schema the relevant bits of, say, an OpenID Connect response, but I'll often store the entire response as well in case I want to add functionality in the future based on additional fields, etc. in that response and roll forward without having to have an intermediate state.

      • tracker1 3 months ago

        Adjacent data that is very alike for most usage but different in subtle ways.

        Classified ads, the additional details for a Car are different than a Shirt, but both would be ads. And adding a nearly infinite number of fields or a flexible system in a set of schema or detail tables is significantly worse than unstructured JSON.

        Another would be records from different, but related systems. Such as the transaction details for an order payment. Paypal data will be different from your CC processor, or debit transaction, but you can just store it as additional details for a given "payment" record.

        Another still would be in healthcare, or any number of other systems where the structures will be different from one system to another depending on data interchange, where in the storage you don't care too much, only in the application layer will it make any necessary difference.

      • mexicocitinluez 3 months ago

        > I have yet to come across an actual use case for JSON stores in a production app with an established design

        Healthcare. Clinical data is unstructured and constantly changing. Would you build out a table with 2000 columns that changed yearly? What about 5000?

      • goosejuice 3 months ago

        Data owned by another service/application/org that is written and read without modification.

        For example, a healthcare document that wasn't built or technically owned by the application storing it.

        For example, a web text editor that serializes it's state as json.

        Not json, but a web scraper storing html documents.

        These have structure, it's only that the structure is built/maintained outside of the application storing it. You could of course transform it but I think it's a bit obvious where that might not be worth the cost/complexity.

      • cryptonector 3 months ago

        You can always change the source of truth so instead of being the JSON texts it's the properly normalized schema and then you can have a VIEW that returns JSON for apps that need JSON. And if need be you can denormalize a bit and have a JSONB column that stores the document as the serialization of the rows' other columns.

        That's what's really nice about SQL: you can perform these sorts of schema surgeries and still retain backwards-compatibility using VIEWs.

      • cruffle_duffle 3 months ago

        A use case for me is I’m calling a lot of different 3rd party APIs that return similar things but in widely different data structures with a lot of information specific to that site. I take what is standard across all the API’s and store that in my relational database but anything I didn’t use I store in a JSONB field just in case I need it some time in the future.

        In other words I’m not really using anything in that JSONB field… at least right now.

      • SSLy 3 months ago

        We're evaluating solutions to ingest business app events and infrastructure logs. Document-ish JSON-ish "DB"/search engine seems like best solution, because no one does structured logs properly.

        Now we have to pick the stack. ELK, Loki + Graphana, Graylog or maybe just dump into MongoDB?

      • okr 3 months ago

        Customers sent custom things.

    • marcus_holmes 3 months ago

      I think my learning was "all data has schema, eventually"

      We thought it was just a bunch of data documents. But it turned out that to actually use that data in an application it had to have predictability, we had to know certain fields were present and in a fixed format. You know: a schema.

      • mexicocitinluez 3 months ago

        I'm confused about this. Since when can't you enforce a schema on a document you're storing? Did you have no way to change the data before it hit the database?

        NoSql stores still have schemas.

  • jimmyl02 3 months ago

    the view I now have is that for a relational table, yes you have to suffer through migrations but at least they are in sql. for document based stores, you still have to have migrations, but they are just implemented in code

    json documents sound great, especially initially, but end up being a maintenance nightmare

    • hans_castorp 3 months ago

      > for document based stores, you still have to have migrations, but they are just implemented in code

      The problem with that - in my experience - is that migrating the structure for thousands (if not millions) of documents is way slower than running a DDL command (as it means reading each document, parsing it, modifying it and writing it back). Many DDL commands are just metadata update to the system catalogs so they are quite fast (e.g. adding a new column with a default value). With documents you wind up with millions of single row updates.

      This can be mitigated by doing a "lazy" migration when a document with the old structure is first read. But that makes the code much more complicated.

    • globular-toast 3 months ago

      Or, to put it another way, yes you have to write and maintain an upfront schema, but a document-based system has a schema too, it's just distributed amongst 50 codebase and 10 people's heads.

  • raverbashing 3 months ago

    > the Price column sometimes had currency symbols, and sometimes commas-and-periods in UK/US format and sometimes in Euro format

    Well this is not a relational issue is it? It is a data normalization issue

    • marcus_holmes 3 months ago

      Yes, true. But that was a single example off the top of my head. There were lots of others.

      Not saying the approach isn't suitable for some use cases. Just that I'd be really careful that this is one of those use cases next time.

  • jamil7 3 months ago

    I don’t do much backend work at the moment but I found using JSONB fields for prototyping endpoints to be quite handy. Once I had a feature figured out and working, I’d create a proper schema.

    • marcus_holmes 3 months ago

      I've found Postgres' Hstore type useful for this, too.

      I often create a "metadata" hstore field on a table and use it for random bits of data I don't want to create an actual field for yet. When I find that the application needs that bit of data and in a certain format, I'll move it into an actual field.

      • cryptonector 3 months ago

        hstore is basically the flat precursor to JSONB, and IMO it's as good as obsolete. I wouldn't use hstore in a new project, just JSONB.

        • marcus_holmes 3 months ago

          I get that, but I think the temptation to add too much structure to the JSONB would be too great; this is just a temporary or prototype field, so it just needs a name and a value. Hstore works great for this.

          • cryptonector 3 months ago

            And I get that, but also hstore is ugly. Just be disciplined :)

  • tracker1 3 months ago

    This is an issue regardless of your type of Document storage. There are different options for verifying document details and versioning when approaching Document oriented databases. A typical usage is to have a per-record version number and pass documents through a verifier either as part of insert, update or query to ensure all migrations to the current version have occurred.

  • goosejuice 3 months ago

    Sounds like most/all of these issues would be solved by validation at the app layer.

    • ivandenysov 3 months ago

      True, but for example with db-level validations I don’t have to worry that a new developer ships a batch job that runs raw inserts for performance reasons and bypasses app layer validations.

      • goosejuice 3 months ago

        I'm not advocating for no database level validations. One can and should have both in my opinion.

        Postgres supports constraints on jsonb as well.

        I would generally advocate for no write access outside of the app as well. Certainly for new developers. Get some tests on that batch job.

        • corytheboyd 3 months ago

          > I would generally advocate for no write access outside of the app as well.

          FWIW I think OP was referring to app code still, but code opting in to skipping app data validations. Rails for example makes this very easy to do, and there are tons and tons of people out there selling this as a performance improvement (this one trick speeds up bulk operations!!!!). There are times where it’s useful, but it’s a very sharp knife that people can easily misuse.

          So yeah, anyway, have both db constraints and app validations.

          • goosejuice 3 months ago

            Thanks for pointing that out, I did misread that and have seen/written such things in rails land. We used active record import which gives the option of running validations.

            But yeah, layering your safety nets is generally wise. I would include testing and code review in that as well.

      • zepolen 3 months ago

        What? You should be exposing your data access layer as a microservice and all your db level validations should be done there...

        You can even write a small DSL syntax language to make it easier to use for developers, and perhaps an Obviously Reduntant Middleware that sits between them to convert their programming language objects to the DSL. Add some batch support, perhaps transactional locks (using mongo, we want to be webscale after all) and perhaps a small terminal based client and voila, no one should ever need to deal with petty integrity in the db again.

        • eropple 3 months ago

          This post is a fantastic example of Poe's Law. You had me for a second.

    • pennomi 3 months ago

      App layer validation is for the user’s sanity. DB layer validation is for the developer’s sanity.

  • cruffle_duffle 3 months ago

    You know I’ve found that LLM’s are awfully good at taking random JSON crap and “converting it” to a consistent format. You don’t even need a large LLM… something like ChatGPT 3.5 or an equivalent class of LLM can do the work just fine.

    It’s not perfect and requires a fair amount of effort to nail the prompt but when it works it works.

zulban 3 months ago

Neat. When I migrated a project from mongo to postgres I took a similar approach, except I only implemented the mongo feel I needed within my own project instead of building a proper library as done here. I was surprised how much performance improved despite using a hacky wrapper.

https://blog.stuartspence.ca/2023-05-goodbye-mongo.html

Personally tho, I plan to just drop all similarity to mongo in future projects.

  • throwaway2037 3 months ago

    Cripes, this is a good blog post. Did it get posted to HN for discussion? If not, it should have been. I especially like this part:

        > I'm trying to stay humble. Mongo must be an incredibly big project with lots of nuance. I'm just a solo developer and absolutely not a database engineer. I've also never had the opportunity to work closely with a good database engineer. However I shouldn't be seeing improvements like this with default out of the box PostgreSQL compared to all the things I tried over the years to fix and tune Mongo.
    
    Humbleness: That is rare to see around here. It is impressive that you go such speed-ups for your use case. Congrats and thank you to share with the blog post.

    EDIT

    The Morgan Freeman meme at the end gave me a real laugh. I would say the same about my experience with GridGain ("GridPain").

    • dayjaby 3 months ago

      > I knew it was time to migrate when my non-programmer wife was becoming familiar with Mongo shortcomings

      That part was so relatable

  • oskar_dudycz 3 months ago

    Yup, I might not reach full compliance, but I will try to follow the Pareto principle. Thanks for the link and kind feedback!

  • jrochkind1 3 months ago

    > I was surprised how much performance improved

    Are you saying you got better performance from postgres jsonb than from mongodb itself?

    • zo1 3 months ago

      From the article, they mention this alongside a neat graph: "API calls generally take 8 ms now, not 150 ms."

      His endpoints went from 150ms (with Mongo) to 8ms after moving to Postgres.

      • jrochkind1 3 months ago

        Thanks. How embarressing for mongo, woah.

  • Thaxll 3 months ago

    I remember your post back then and it did not made sense at all, many pointed out it was lacking information and you probably did something wrong with mongo.

    All the stuff under Mongo Problems is garbage, sorry.

harel 3 months ago

I regularly find the hybrid model is a sweet spot. I keep core fields as regular columns and dynamic data structures as JSONB. It brings the best of both worlds together.

  • Waterluvian 3 months ago

    I do this too with Postgres and it is just the best of both.

    A robot is a record. A sensor calibration is a record. A warehouse robot map with tens of thousands of geojson objects is a single record.

    If I made every map entity its own record, my database would be 10_000x more records and I’d get no value out of it. We’re not doing spatial relational queries.

    • hobs 3 months ago

      It's great when you have no reason EVER to decompose the data.

      That being said, when you start going "wait why is one record like this? oh no we have a bug and have to fix one of the records that looks like this across all data" and now you get to update 10,000x the data to make one change.

      • harel 3 months ago

        Small price to pay in my opinion. How often will that happen vs how often the database is used. Migrations like that can be done incrementally over time. It's a solved problem.

        • Waterluvian 3 months ago

          It’s also trivial to do. My JSON fields are all backed by JSON Schema. And I just write a data migration that mutates the data in some way and have the migration run by one host in a rate limited manner. It’s not quite as good as a traditional change in schema but it’s such a non-issue.

          • hobs 3 months ago

            I am glad it works! I have just been subject to several systems that have grown over time that worked very well until it became a problem (and then a huge one) so I am glad you are taking a disciplined approach.

            • Waterluvian 3 months ago

              Yup you’re absolutely right. There is no one size fits all. The more you can plan, and the more clear your case is, the less you need to pay for having flexibility.

  • webprofusion 3 months ago

    Yeah we do this as well. Have previously been a heavy MongoDB user but when we migrated to Postgres (using some relational schema with some JSONB for nested objects etc) it just made querying so much easier and reliability has been great.

willsmith72 3 months ago

It's technologically cool, but I would love a "why" section in the README. Is the idea you're a mongo Dev/love the mongo api and want to use it rather than switch to pg apis? Or want to copy some code over from an old project?

I'm sure there are use cases, I'm just struggling to grasp them. Especially if it's about reusing queries from other projects, AI is pretty good at that

  • megadal 3 months ago

    > AI is pretty good at that

    Good at what? Rewriting the queries?

    I think the point of Pongo is you can use the exact same queries for the most part and just change backends.

    I've worked a job in the past where this would have been useful (they chose Mongo and regretted it).

    • willsmith72 3 months ago

      For sure, but it feels really risky. If it's a small codebase, I would be more confident knowing what queries I was using and just switch them. If it's a large codebase, I'd want some really comprehensive test coverage, including performance tests

ilius2 3 months ago

If I were to start a new project, I would directly use postgres, and possibly add a JSONB column ONLY FOR OPTIONAL fields that you don't query frequently. Throwing everything in a document is just fermenting chaos and pain. That being said, I do love the syntax and structure of Mongo pipelines over SQL.

  • throwaway76324 3 months ago

    At $WORK, we use the same approach for integrations with 3rd party systems.

    The data that is common for all integrations are stored as columns in a relational table. Data that are specific for each integration are stored in JSONB. This is typically meta data used to manage each integration that varies.

    It works great and you get the combination of relational safety and no-schema flexibility where it matters.

  • stanislavb 3 months ago

    "Throwing everything in a document is just fermenting chaos and pain." - I LOVE THIS.

Squarex 3 months ago

How does it compare with FerretDB[0]?

[0] https://www.ferretdb.com/

  • aleksi 3 months ago

    (I'm FerretDB co-founder)

    As far as I can tell, Pongo provides an API similar to the MongoDB driver for Node that uses PostgreSQL under the hood. FerretDB operates on a different layer – it implements MongoDB network protocol, allowing it to work with any drivers and applications that use MongoDB without modifications.

    • Sytten 3 months ago

      I dont want to sound rude, but as a bootstrap founder it kinda boggles my mind how much money people can raise for a product like ferretdb. I just don't see how it can make VC level return without at the very least changing licenses which seems to ne the premise behind creating this MongoDB proxy. I am sure there is a narrative for it though so best of luck!

      Also check you managed service links on GitHub, half are dead.

      • aleksi 3 months ago

        If bait-and-switch were our strategy, we would have chosen a different license from the beginning. The Apache license allows everyone to fork FerretDB away and do whatever they like with it. It is unlike MongoDB with their initial AGPL that, in theory, allows everyone to, say, run MongoDB SaaS, but in practice, has enough strings attached to scare people off.

        We want to have a piece of a bigger pie, not a bigger piece of an existing pie. Providing alternatives makes the whole market bigger.

        > Also check you managed service links on GitHub, half are dead.

        Thank you.

  • Zambyte 3 months ago

    The posted project looks like a client that connects to pg but behaves like Mongo, where Ferret is a server that accepts Mongo client connections and uses pg as backend storage.

  • oskar_dudycz 3 months ago

    Yes, I'm using MongoDB API in Pongo to keep the muscle memory. So, it's a library that translates the MongoDB syntax to native PostgreSQL JSONB queries.

    • remram 3 months ago

      "how does it compare?" "yes"

      Are you sure you replied to the right comment?

      • oskar_dudycz 3 months ago

        Yes to "The posted project looks like a client that connects to pg but behaves like Mongo, where Ferret is a server that accepts Mongo client connections and uses pg as backend storage."

        • remram 3 months ago

          Yeah, FYI that's a sibling comment, not the one you replied to.

hdhshdhshdjd 3 months ago

I use JSONB columns a lot, it has its place. It can fit certain applications, but it does introduce a lot of extra query complexity and you lose out on some ways to speed up query performance that you could get from a relational approach.

Which is to say JSONB is useful, but I wouldn’t throw the relational baby out with the bath water.

  • noisy_boy 3 months ago

    I think if a field in a JSONB column is used frequently in the most common query patterns, it might be time to "surface" it up among other non-JSONB relational columns. There will be some additional overhead e.g. it should be made a read-only column for consistency's sake; if you update the value of this field in the JSONB column, this surfaced column should also be updated.

    However, a bit of duplication is not a terrible trade-off for significantly improved query performance.

  • doctor_eval 3 months ago

    I’ve been doing some reasonably serious playing with the idea of using jsonb columns as a kind of front end to relational tables. So basically, external interactions with the database are done using JSON, which gives end users some flexibility, but internally we effectively create a realtime materialised view of just those properties we need from the json.

    Anyone else tried this approach? Anything I should know about it?

    • Zenzero 3 months ago

      My mental model has always been to only use JSONB for column types where the relations within the json object are of no importance to the DB. An example might be text editor markup. I imagine if you start wanting to query within the json object you should consider a more relational model.

      • doctor_eval 3 months ago

        My experience has been that users like to include a lot of information that's not relevant to the application I'm working on.

        My application couldn't really care less about customer names, for example. But the people who buy my software naturally do care - and what's worse, each of my potential customers has some legacy system which stores their customer names in a different way. So one problem I want to address is, how do I maintain fidelity with the old system, for example during the data migration, while enabling me to move forward quickly?

        My solution has been to keep non-functional data such as customer names in JSON, and extract only the two or three fields that are relevant to my application, and put them into a regular SQL database table.

        So far this has given me the best of both worlds: a simple and highly customisable JSON API for these user-facing objects with mutable shapes, but a compact SQL backend for the actual work.

      • mbesto 3 months ago

        I always thought JSONB is a great use case for client-specific customizations. For example - a CRM application where you have structured DB tables like "account", "opportunity", etc. but then a client-specific table might be "cohort" that is linked to "account" and has fields "cohort size", "cohort name", etc. You just chuck the cohort table in JSONB and now the company who uses the CRM that has the custom cohort table can change that table without creating/migrating your RDMS table structure.

        Curious what other people think here?

        • Zenzero 3 months ago

          That sounds like something that would be better tailored to well normalized tables and custom row entries. The question I'd ask is if something you're concerned about being a column could instead be a row in a different table.

    • hdhshdhshdjd 3 months ago

      I do something similar, building a lightweight search index over very large relational datasets.

      So the tables are much simpler to manage, much more portable, so I can serve search off scalable hardware without disturbing the underlying source of truth.

      The downside is queries are more complex and slower.

  • martinald 3 months ago

    I've had extremely disappointing performance with postgres JSONB columns. I've used it a lot for MVPs and betas where it's much easier to use JSONB and lock in your exact relational schema lately.

    I've now decided this path is a mistake because the performance is so bad. Even with low thousands/tens of thousands of rows it becomes a huge problem, queries that would take <1ms on relational stuff quickly start taking hundreds of ms.

    Optimizing these with hand rolled queries is painful (I do not like the syntax it uses for jsonb querying) and for some doesn't really fix anything much, and indexes often don't help.

    It seems that jsonb is just many many order of magnitudes slower, but I could be doing something wrong. Take for example storing a dictionary of string and int (number?) in a jsonb column. Adding the ints up in jsonb takes thousands of times longer rather than having these as string and int in a standard table.

    Perhaps I am doing something wrong; and I'd love to know it if I am!

    • ak217 3 months ago

      Did you have arrays in your jsonb data? I don't currently use jsonb for production performance sensitive queries, but in the past what I learned was that it's great at automatically indexing everything except arrays. You have to manually declare indexes on your arrays.

      When jsonb works, it's incredible. I've had many... suboptimal experiences with mongo, and jsonb is just superior in my experience (although like I said, I haven't used it for performance critical stuff in production). For a long time, it kinda flew under the radar, and still remains an underappreciated feature of Postgres.

    • phamilton 3 months ago

      > indexes often don't help.

      An "index on expression" should perform the same regardless of the input column types. All that matters is the output of the expression. Were you just indexing the whole jsonb column or were you indexing a specific expression?

      For example, an index on `foo(user_id)` vs `foo(data->'user_id')` should perform the same.

frithsun 3 months ago

Programmers would be better served by learning nothing except SQL instead of their current strategy of trying to learn everything except SQL.

  • dboreham 3 months ago

    They should learn about b-trees and how indexed queries can be done with them either with or without an explicit query language. Then they can decide what kind of data storage service they need. Understand what's happening inside the black box.

    • cryptonector 3 months ago

      Yes, for sure, though I'd still start with SQL.

  • dudeinjapan 3 months ago

    While I’d agree that understanding SQL basics is an important fundamental for novices to learn, I started using MongoDB 11 years ago and haven’t looked back.

karmakaze 3 months ago

What makes mongo mongo is its distibruted nature, without it you could just store json(b) in an RDBMS.

  • rework 3 months ago

    > What makes mongo mongo is its distibruted nature

    Since when? Mongo was popular because it gave the false perception it was insanely fast until people found out it was only fast if you didn't care about your data, and the moment you ensure write happened it ended up being slower than an RDB....

    • jokethrowaway 3 months ago

      Since forever, sharding, distributing postgres / mysql was not easy. There were a few proprietary extensions. Nowadays it's more accessible.

      This was typical crap you had to say to pass fang style interview "oh of course I'd use mongo because this use case doesn't have relations and because it's easy to scale", while you know postgres will give you way less problems and allow you to make charts and analytics in 30m when finance comes around.

      I made the mistake of picking mongo for my own startup, because of propaganda coming from interviewing materials and I regretted it for the entire duration of the company.

      • threeseed 3 months ago

        > Nowadays it's more accessible

        Distributing PostgreSQL still requires proprietary extensions.

        With the most popular being Citus which is owned by Microsoft and so questions should definitely remain about how long they support that instead of pushing users to Azure.

        People like to bash MongoDB but at least they have a built-in, supported and usable HA/Clustering solution. It's ridiculous to not have this in 2024.

        • foobarkey 3 months ago

          Trying to use MongoDB by default for new projects because of the built in HA, dumped Postgres because the HA story is so bad on bare metal (alright if you are ok burning money on RDS or simiar).

          Current preference: 1. HA MongoDB 2. HA MariaDB (Galera) or MySQL Cluster 3. Postgres Rube Goldberg Machine HA with Patroni 4. No HA Postgres

  • richwater 3 months ago

    > store json(b) in an RDBMS

    I actually did this for as small HR application and it worked incredible well.jsonb gin indexes are pretty nice once you get the hang of the syntax.

    And then, you also have all the features of Postgres as a freebie.

    • eddd-ddde 3 months ago

      Personally, I much better like postgres json syntax than whatever mongo invented.

      Big fan of jsonb columns.

      • oskar_dudycz 3 months ago

        I'm planning to add methods for raw JSON path or, in general, raw SQL syntax to enable such fine-tuning and not need to always use MongoDB API. I agree that for many people, this would be better.

  • anonzzzies 3 months ago

    So how easy is it to distribute it? I don’t have experience with it but the tutorials look terrible compared to, say, Scylla, Yuga, Cockroach, TiDB etc. Again, honest question?

    • tracker1 3 months ago

      Relatively easy... though, and I may be out of date, you have to choose replication or sharding at any given level... so if you want horizontal scale plus redundancy, you will wind up with slightly more complexity. My experience trying to recover a broken cluster after 5 hours of Azure going up/down in series was anything but fun.

      Would have to test, but the library for this post may well work with CockroachDB if you wanted to go that route instead of straight PostgreSQL. I think conceptually the sharding + replication of other DBs like Scylla/Cassandra and CockroachDB is a bit more elegant and easier to reason with. Just my own take though.

      • karmakaze 3 months ago

        I don't see any point in emulating MongoDB (NoSQL) document store on CockroachDB (NewSQL) which is already sharded and replicated.

        • redwood 3 months ago

          Right if you want to use a sharded and replicated system with a document data model, might as well use the real thing rather than introduce the risk and impedance mismatch of an emulation layer. The whole point here is for folks to not have to have the cognitive and performance downsides of a tabular data model

    • rad_gruchalski 3 months ago

      Pongo seems to be a middleware between your app and Postgres. So it will most certainly work absolutely fine on YugabyteDB, if one’s okay with occasional latency issues.

      One could optimise it more for a distributed sql by implementing key partition awareness and connecting directly to a tserver storing the data one’s after.

    • theteapot 3 months ago

      Does "distributed" mean sharded or just replicated? In either case it's a bit quirky but easy enough.

      > Scylla, Yuga, Cockroach, TiDB etc.

      You have experience "distributing" all these DBs? That's impressive.

      • anonzzzies 3 months ago

        We evaluated these over the years for different purposes/clients. And I don’t expect the ‘quirky’ part in 2024 I guess; it gets hard of course when the loads get specific; but in general I expect things to be automatic (multi master, sharding and replication) and simple to set up. Point out other nodes and done.

    • Too 3 months ago

      It couldn’t be simpler. Just configure all clients and servers to be aware of each other and go.

      Say what you want about the rest of mongo. This is an area where it actually shines.

      • anonzzzies 3 months ago

        Alright, I'll try it. Maybe I had the wrong tutorials or maybe these were old.

  • darby_nine 3 months ago

    but then you wouldn't have the joy of using the most awkward query language invented by mankind

  • zihotki 3 months ago

    But RDBMS'es are often also distributed. So what is mongo now?

    • marcosdumay 3 months ago

      People don't usually distribute Postgres (unless you count read replicas and cold HA replicas). But well, people don't usually distribute MongoDB either, so no difference.

      In principle, a cluster of something like Mongo can scale much further than Postgres. In practice, Mongo is full of issues even before you replicate it, and you are better with something that abstracts a set if incoherent Postgres (or sqlite) instances.

      • zozbot234 3 months ago

        Postgres supports foreign data wrapper (FDW), which is the basic building block for a distributed DB. It doesn't support strong consistency in distributed settings as of yet, although it does provide two-phase commit which could be used for such.

        • williamdclt 3 months ago

          > strong consistency in distributed settings

          I doubt it ever will. The point of distributing a data store is latency and availability, both of which would go down the drain with distributed strong consistency

      • hibikir 3 months ago

        I think of the Stripe Mongo install, as it was a decade or so ago. It really was sharded quite wide, and relied on all shards having multiple replicas, as to tolerate cycling through them on a regular basis. It worked well enough to run as a source of truth for a financial company, but the database team wasn't small, dedicated to keeping all that machinery working well.

        Ultimately anyone doing things at that scale is going to run a small priesthood doing custom things to keep the persistence payer humming, regardless of what the underlying database is. I recall a project abstracting over the Mongo API, as to allow for swapping the storage layer if they ever needed to

    • brabel 3 months ago

      Often?? In my experience it's really hard to do it and still maintain similar performance, which kind of voids any benefit you may be looking for.

    • otabdeveloper4 3 months ago

      Postgres doesn't do replication correctly.

      Mongo does. (In fact, replication is about the only thing Mongo does correctly.)

      If you actually want a replicated log then Mongo is a very good choice. Postgres isn't.

  • lkdfjlkdfjlg 3 months ago

    > What makes mongo mongo is its distibruted nature, without it you could just store json(b) in an RDBMS.

    Welllllllll I think that's moving the goalposts. Being distributed might be a thing _now_ but I still remember when it was marketed as the thing to have if you wanted to store unstructured documents.

    Now that Postgres also does that, you're marketing Mongo as having a different unique feature. Moving the goalposts.

    • thfuran 3 months ago

      It doesn't really seem reasonable to accuse someone of moving goalposts that you've just brought into the conversation, especially when they were allegedly set by a third party.

      • coldtea 3 months ago

        Parent didn't "just brought them", they merely referrenced the pre-existing goalposts used to advocate for Mongo and reasons devs adopted it.

        • lkdfjlkdfjlg 3 months ago

          Exactly this, very eloquent, thank you.

          Yes, I'm still bitter because I was one of those tricked into it.

mvsdiego 3 months ago

Nice.

Oracle has a similar library based documents/collections API named SODA, been around for years:

https://docs.oracle.com/en/database/oracle/simple-oracle-doc...

There are separate drivers for Java, node.js, python, REST, etc.

In addition to that, it has Mongo API, which is fully Mongo compatible - you can use standard Mongo tools/drivers against it, without having to change Mongo application code.

Both are for Oracle Database only, and both are free.

salomonk_mur 3 months ago

What would be the advantage of using this instead of simple jsonb columns?

  • joshmanders 3 months ago

    It uses JSONb under the hood. Just gives you a very "mongo" feel to using PostgreSQL. Not sure how I feel about it.

        CREATE TABLE IF NOT EXISTS %I (_id UUID PRIMARY KEY, data JSONB)
    • wood_spirit 3 months ago

      Can they make it use uuid7 for ids for better insert_becomes_append performance?

      • lgas 3 months ago

        Yes

        • oskar_dudycz 3 months ago

          Yes, I'm using JSONB underneath and translating the MongoDB syntax to native queries. As they're not super pleasant to deal with, then I thought that it'd be nice to use some familiar to many MongoDB API.

          Regarding IDs, you can use any UUID-compliant format.

  • lopatin 3 months ago

    jsonb isn't web scale. Mongo is web scale.

    • digger495 3 months ago

      I see what you did there

  • imnotjames 3 months ago

    Looks like it natches the mongo node API

aussieguy1234 3 months ago

I'd like to know why AWS went with Aurora DB for their DocumentDB backend. Did the Mongo license change trigger a rush to build something Mongo compatible, but not quite MongoDB?

sberder 3 months ago

This looks great, I'll definitely give it a try. As many mentioned already, having classic columns and a JSON(B) column seems to be a common solution. How do you handle data validation for the JSON documents? My current project uses Django for metadata. I've been thinking about creating a layer similar to model fields in Django. You would declare a JSON "model" through those fields and assign it to the actual model JSON field.

  • throwaway76324 3 months ago

    You can just specify a model/DTO object and serialize it as JSON when saving. Many frameworks do that automatically so you don't need to think about it. At work we just annotate the field in the model as a json-field, and the framework will handle the json-conversion automatically and store the other fields in the model as regular database columns.

    pseudo code (to not trigger language wars):

       class Foo {
       
           @Id
           UUID id;
    
           String name;
    
           @Json
           MyCustomModel model;
       }
    
    Adding fields is not an issue, as it will simply be missing a value when de-serializing. Your business logic will need to handle its absence, but that is no different than using MongoDB or "classic" table columns
    • sberder 3 months ago

      That's a very low cost approach, I love it! I still think the Django ecosystem would benefit from a standardized/packaged approach including migrations. I'll ponder a bit more

  • oskar_dudycz 3 months ago

    Thank you! I'm planning to add support to JSON schema and run the validation upon insert/update operation.

314156 3 months ago

https://docs.oracle.com/en/database/oracle/mongodb-api/mgapi...

Oracle database has had a MongoDB compatible API for a few years now.

  • cyberpunk 3 months ago

    And it only costs 75k a seat per year per developer, with free bi yearly license compliance audits, a million in ops and hardware to get near prod and all the docu is paywalled. What a deal!

    • slau 3 months ago

      A client had a DB hosted by Oracle. The client was doing most of their compute on AWS, and wanted to have a synchronised copy made available to them on AWS. Oracle quoted them a cool $600k/year to operate that copy, with a 3 year contract.

      DMS + Postgres did it for $5k/year.

      • cyberpunk 3 months ago

        Client of mine wanted to shift a rac cluster from some aging sparc gear into a VMware or openstack or whatever farm they had on premise; oracle demanded they pay CPU licenses for every single CPU in the cluster as each one could “potentially” run the oracle database, quoted them seven figures.

        They rewrote the app instead.

vmfunction 3 months ago

Hmmm how does this compare to https://www.ferretdb.com ?

How does this handle large files? Is it enough to replace GridFS? One of main attraction for MongoDb is it's handling of large files.

pipe_connector 3 months ago

MongoDB has supported the equivalent of Postgres' serializable isolation for many years now. I'm not sure what "with strong consistency benefits" means.

  • zihotki 3 months ago

    Or is it? Jepsen reported a number of issues like "read skew, cyclic information flow, duplicate writes, and internal consistency violations. Weak defaults meant that transactions could lose writes and allow dirty reads, even downgrading requested safety levels at the database and collection level. Moreover, the snapshot read concern did not guarantee snapshot unless paired with write concern majority—even for read-only transactions."

    That report (1) is 4 years old, many things could have changed. But so far any reviewed version was faulty in regards to consistency.

    1 - https://jepsen.io/analyses/mongodb-4.2.6

    • pipe_connector 3 months ago

      Jepsen found a more concerning consistency bug than the above results when Postgres 12 was evaluated [1]. Relevant text:

      We [...] found that transactions executed with serializable isolation on a single PostgreSQL instance were not, in fact, serializable

      I have run Postgres and MongoDB at petabyte scale. Both of them are solid databases that occasionally have bugs in their transaction logic. Any distributed database that is receiving significant development will have bugs like this. Yes, even FoundationDB.

      I wouldn't not use Postgres because of this problem, just like I wouldn't not use MongoDB because they had bugs in a new feature. In fact, I'm more likely to trust a company that is paying to consistently have their work reviewed in public.

      1. https://jepsen.io/analyses/postgresql-12.3

    • endisneigh 3 months ago

      That’s been resolved for a long time now (not to say that MongoDB is perfect, though).

      • nick_ 3 months ago

        I just want to point out that 4 years is not a long time in the context of consistency guarantees of a database engine.

        I have listened to Mongo evangelists a few times despite my skepticism and been burned every time. Mongo is way oversold, IMO.

  • throwup238 3 months ago

    > I'm not sure what "with strong consistency benefits" means.

    "Doesn't use MongoDB" was my first thought.

  • danpalmer 3 months ago

    MongoDB had "strong consistency" back in 2013 when I studied it for my thesis. The problem is that consistency is a lot bigger space than being on or off, and MongoDB inhabited the lower classes of consistency for a long time while calling it strong consistency which lost a lot of developer trust. Postgres has a range of options, but the default is typically consistent enough to make most use-cases safe, whereas Mongo's default wasn't anywhere close.

    They also had a big problem trading performance and consistency, to the point that for a long time (v1-2?) they ran in default-inconsistent mode to meet the numbers marketing was putting out. Postgres has never done this, partly because it doesn't have a marketing team, but again this lost a lot trust.

    Lastly, even with the stronger end of their consistency guarantees, and as they have increased their guarantees, problems have been found again and again. It's common knowledge that it's better to find your own bugs than have your customers tell you about them, but in database consistency this is more true than normal. This is why FoundationDB are famous for having built a database testing setup before a database (somewhat true). It's clear from history that MongoDB don't have a sufficiently rigorous testing procedure.

    All of these factors come down to trust: the community lacks trust in MongoDB because of repeated issues across a number of areas. As a result, just shipping "strong consistency" or something doesn't actually solve the root problem, that people don't want to use the product.

    • pipe_connector 3 months ago

      It's fair to distrust something because you were burned by using it in the past. However, both the examples you named -- Postgres and FoundationDB -- have had similar concurrency and/or data loss bugs. I have personally seen FoundationDB lose a committed write. Writing databases is hard and it's easy to buy into marketing hype around safety.

      I think you should reconsider your last paragraph. MongoDB has a massive community, and many large companies opt to use it for new applications every day. Many more people want to use that product than FoundationDB.

      • daniel-grigg 3 months ago

        Can you elaborate on why ‘many large companies’ are choosing MongoDB over alternatives and what their use cases are? I’ve been using Mdb for a decade and with how rich the DB landscape is for optimising particular workloads I just don’t see what the value proposition is for Mdb is compared to most of them. I certainly wouldn’t use it for any data intensive application when there’s other fantastic OLAP dbs, nor some battle hardened distributed nodes use case, so that leaves a ‘general purpose db with very specific queries and limited indexes’. But then why not just use as PG as others say?

      • azurelake 3 months ago

        I’d be curious to hear more detail about the FoundationDB data loss issue that you saw? Do you remember what version / what year that you saw it?

    • nijave 3 months ago

      Have you looked at versions in the last couple years to see if they've made progress?

      • danpalmer 3 months ago

        This kinda misses my point. By having poor defaults in the past, marketing claims at-odds with reality, and being repeatedly found to have bugs that reduce consistency, the result is that customer have no reason to trust current claims.

        They may have fixed everything, but the only way to know that is to use it and see (because the issue was trusting marketing/docs/promises), and why should people put that time in when they've repeatedly got it wrong, especially when there are options that are just better now.

        • nijave 3 months ago

          Right, I was curious if you put even more time in :)

          I see lots of comments from people insisting it's fixed now but it's hard to validate what features they're using and what reliability/durability they're expecting.

    • throwaway2037 3 months ago

          > my thesis
      
      Can you share a link? I would like to read your research.
  • Izkata 3 months ago

    > MongoDB has supported the equivalent of Postgres' serializable isolation for many years now.

    That would be the "I" in ACID

    > I'm not sure what "with strong consistency benefits" means.

    Probably the "C" in ACID: Data integrity, such as constraints and foreign keys.

    https://www.bmc.com/blogs/acid-atomic-consistent-isolated-du...

  • lkdfjlkdfjlg 3 months ago

    > Pongo - Mongo but on Postgres and with strong consistency benefits.

    I don't read this as saying it's "MongoDB but with...". I read it as saying that it's Postgres.

  • jokethrowaway 3 months ago

    Have you tried it in production? It's absolute mayhem.

    Deadlocks were common; it uses a system of retries if the transaction fails; we had to disable transactions completely.

    Next step is either writing a writer queue manually or migrating to postgres.

    For now we fly without transaction and fix the occasional concurrency issues.

    • pipe_connector 3 months ago

      Yes, I have worked on an application that pushed enormous volumes of data through MongoDB's transactions.

      Deadlocks are an application issue. If you built your application the same way with Postgres you would have the same problem. Automatic retries of failed transactions with specific error codes are a driver feature you can tune or turn off if you'd like. The same is true for some Postgres drivers.

      If you're seeing frequent deadlocks, your transactions are too large. If you model your data differently, deadlocks can be eliminated completely (and this advice applies regardless of the database you're using). I would recommend you engage a third party to review your data access patterns before you migrate and experience the same issues with Postgres.

      • akoboldfrying 3 months ago

        >Deadlocks are an application issue.

        Not necessarily, and not in the very common single-writer-many-reader case. In that case, PostreSQL's MVCC allows all readers to see consistent snapshots of the data without blocking each other or the writer. TTBOMK, any other mechanism providing this guarantee requires locking (making deadlocks possible).

        So: Does Mongo now also implement MVCC? (Last time I checked, it didn't.) If not, how does it guarantee that reads see consistent snapshots without blocking a writer?

        • devit 3 months ago

          Locking doesn't result in deadlocks, assuming that it's implemented properly.

          If you know the set of locks ahead of time, just sort them by address and take them, which will always succeed with no deadlocks.

          If the set of locks isn't known, then assign each transaction an increasing ID.

          When trying to take a lock that is taken, then if the lock owner has higher ID signal it to terminate and retry after waiting for this transaction to terminate, and sleep waiting for it to release the lock.

          Otherwise if it has lower ID abort the transaction, wait for the conflicting transaction to finish and then retry the transaction.

          This guarantees that all transactions will terminate as long as each would terminate in isolation and that a transaction will retry at most once for each preceding running transaction.

          It's also possible to detect deadlocks by keeping track of which thread every thread is waiting for and signaling the either the highest transaction ID in the cycle or the one the lowest ID is waiting for to abort, wait for ID it was waiting for terminate and retry.

          • akoboldfrying 3 months ago

            Yes, I'm aware that deadlock can be avoided if the graph having an edge uv whenever a task tries to acquire lock v while already holding lock u is acyclic, and this property can either be guaranteed by choosing a total order on locks and then only ever acquiring them in this order or, or dynamically maintained by detecting tasks that potentially violate this order and terminating them, plus retries.

            However, those techniques apply only to application code where you have full control over how locks are acquired. This is generally not the case when feeding declarative SQL queries to a DBMS, part of whose job is to decide on a good execution plan. And even in application code, assuming a knowledgeable programmer, they need to either know about all locks in the world or run complex and expensive bookkeeping to detect and break deadlocks.

            The fundamental problem is that locks don't compose the way other natural CS abstractions (like, say, functions) do: https://stackoverflow.com/a/2887324

        • pipe_connector 3 months ago

          MongoDB (via WiredTiger) has used MVCC to solve this problem since transactions were introduced.

    • threeseed 3 months ago

      > Next step is either writing a writer queue manually

      You can just use a connection pool and limit writer threads.

      You should be using one to manage your database connections regardless of which database you are using.

DonnyV 3 months ago

Would love a C# version of this. I usually use Mongodb for all of our projects. But we need to use Postgres for a project. This would come in very handy.

  • oskar_dudycz 3 months ago

    You can check Marten, that I was co-maintaining: https://martendb.io/. It doesn't have MongoDB-compliant API, but it's mature, stable and efficient.

deanCommie 3 months ago
  • oskar_dudycz 3 months ago

    Yes, a similar idea, but I don't aim to be 100% MongoDB compliant or full replacement. My goal is to use as many of PostgreSQL features as possible. Having the library level as translation will allow more scenarios like, e.g. sharing connection and using PostgreSQL hosting.

tracker1 3 months ago

Has this been tested with CockroachDB or any other databases that use a mostly compatible PostgreSQL wire protocol and query language?

Tao3300 3 months ago

Ditch the dalmatian before Disney rips your face off.

rework 3 months ago

Looks sort of like MartenDB but trying to minic mongo api, unsure why anyone would want to do that... mongo api is horrible...

  • JanSt 3 months ago

    Wouldn't that allow to switch from Mongo to Postgres without having to rewrite all of your app?

    • oskar_dudycz 3 months ago

      Hint: I'm an ex-Marten maintainer, so the similarity is not accidental ;)

      As Op said, not needing to rewrite applications or using the muscle memory from using Mongo is beneficial. I'm not planning to be strict and support only MongoDB API; I will extend it when needed (e.g. to support raw SQL or JSON Path). But I plan to keep shim with compliant API for the above reasons.

      MongoDB API has its quirks but is also pretty powerful and widely used.

      • rework 3 months ago

        Oh, so you are, then we can rest assured this will end up being a solid project!

        I personally can't stand mongodb, its given me alot of headaches, joined a company and the same week I joined we lost a ton of data and the twat who set it up resigned in the middle of the outage. Got it back online and spend 6m moving to postgresql.

        • oskar_dudycz 3 months ago

          Thanks, that's the goal: to bring the solid and verified approach in Marten to Node.js land. The concept is similar, but the feature set will be different.

ramchip 3 months ago

Have you tried it with CockroachDB?

  • oskar_dudycz 3 months ago

    I did not, but I'm not using any fancy syntax so far besides JSONB operators. If it won't work, then I'm happy to adjust it to make it compliant.

joeyagreco 3 months ago

Good work! I would like to see a section on the README outlining the benefits of Pongo

  • oskar_dudycz 3 months ago

    Thanks, I'll try to cover that, good call!

posix_monad 3 months ago

Does MongoDB have serious market share compared to DynamoDB (and similar clones from Azure, GCP) at this point?

  • maxdo 3 months ago

    Mongodb and dynamodb are completely different dbs. One is unlimited scale KV but very expensive , another is document nosql db that sells you idea “it just works” for lots of features , indexes on anything , aggregation , time series . Vector DB, sharding , replicas etc . It’s a very powerful db for sure.