République
Française
Introducing udata 4, codename: cockroach — you'll see why in a bit!
This a follow up on the latest major release (udata 3) roadmap post.
Our goal here is to highlight the major coming with udata 4 and what we think can be built with it.
It's intended for anyone looking at udata for their needs or already using it. Please be aware, it can get quite technical.
We'll also talk about the community around udata and how we hope to share and communicate in the future.
Since the release of udata 3, aka snowpiercer, we mainly focused on the development of our new frontend, which was necessary for the graphical overhaul of data.gouv.fr.
We believe this new system is now stable and brought us closer to our ultimate goal, as previously stated: udata
should be focused on bringing an API and business logic around an open data catalog. By removing (almost) anything frontend-related to udata-front
, we made a big step toward this. As planned, we iterated a lot around this graphical redesign and we did so in a swift manner.
Now that the frontend architecture is out of the way, we can focus on new components, following the same approach: services around udata
rather than a fat monolith. Thoses services are (so far):
If you're a udata user, you should know that, moving forward, you'll either:
Our search engine has long been a frustrating part for the users of data.gouv.fr. Searching through 40k+ datasets with often poor metadata can be a daunting experience and is a true technical challenge.
That's why we decided to rebuild our search engine from the ground up. We relied on an old version of ElasticSearch, tightly coupled to a lot of udata code. Our goals were to:
udata
In order to do that, we built udata-search-service. It's responsible for managing the search index: (de)indexing search metadata for every object and sending back search results given a query.
You'll notice Apache Kafka has been introduced in this architecture. It acts as a broker between udata
and udata-search-service
for indexation purposes.
While the usage Apache Kafka is largely overkill for this scenario, we believe that moving towards an event based model, through Kafka, will allow us to build powerful services around udata. You should also know why this release has been dubbed cockroach now :-)
We made a difference between list endpoints and smart full-text search endpoints. Existing list endpoints in API v1 (/api/1/datasets
) still exists but is based on Mongo lists. It is the case for organizations, datasets, reuses and users. Some parameters have been dropped, but a q
parameter is still supported.
The smart full-text search is now optional and is enabled by setting the SEARCH_SERVICE_API_URL setting. It is now defined in API v2 with the following url pattern: /{object}/search
(ex: /datasets/search
). This endpoint makes an HTTP request to a remote server with the query and parameters. You can easily plug your own full text search engine API here or use our own udata-search-service!
Finally, the suggest endpoints now use mongo contains
instead of a smart full-text search. The user suggest endpoint has been dropped.
One possibility we're very excited about with this new architecture is being able to analyze what's going on in the data we host or link to. Up until now, we almost exclusively focused on metadata and we believed it's time to dig deeper!
We're still brainstorming and experimenting around this, but the architecture could look something like below. As you can see, Kafka becomes a primary citizen and distributes events to and from multiples services:
All those services shall communicate through Kafka and enrich whichever service is interested. We're also thinking about communicating with external systems (i.e. services outside our architecture) through a REST gateway.
As stated, we think of the new event-driven architecture as an enabler for future services and features. Thinking long term, it might even be a way to integrate external systems more seamlessly with udata or data.gouv.fr.
The last time we wrote on udata, regarding the v3 release, we highlighted the fact that Etalab was pretty much alone in maintaining udata.
An overwhelming majority of commits on udata have been made by members of Etalab, the French government agency responsible for data.gouv.fr.
The technical team dedicated at Etalab responsible not only for development but also for operating the platform and handling user support has always been relatively small: between one and three people.
We also had lost touch with the main platforms reusing udata for their needs, leading us to focus on our needs first and not so much on the community's.
Since then, we are happy to have had renewed contacts with some udata users, mainly the governments of Serbia and Luxembourg. We felt an increasing need for help and documentation, mainly for dealing with a migration to the latest version of udata.
We heard this and we will try to produce more documentation and give more visibility on our technical choices, hence this post. We also revamped our main documentation quite a bit. A much needed focus on the new metrics (Matomo) collection system has been written too.
If you're interested in udata 4, you can join the community to discuss architectural changes or get started with udata 4 at https://github.com/opendatateam/udata/discussions/2724.
We believe we need a vibrant and responsive community in order for those efforts to be successful. It can be pretty frustrating producing those efforts in the dark.
That's why we need your help! Our first step is opening Github discussions on the udata project. If you're using or are interested in udata, please come and comment on this discussion. Tell us how you're using it, what you're expecting from the community and are willing to give back... Also feel free to open a new discussion if you want to reach out for something broader than a specific udata technical issue (plugins, governance, migration, installation...).
udata-front
and a cleaner separation of concern between front and back