Hacker Newsnew | past | comments | ask | show | jobs | submit | nooorofe's commentslogin

it was fixed with workaround, in newer versions, they copy statically compiled nodejs to the server and it works.


> if you want to convert some scheduled pipeline to some event-driven architecture

Airflow has sensors and triggers. https://airflow.apache.org/docs/apache-airflow/stable/author...

But in the core it is built around data pipeline concept, event driven pipeline will much more fragile. Airflow intentionally doesn't manage business logic, it works with "tasks".


Yes, but that means you are forced to build EDA on top of Airflow, which may not be ideal for many cases. You are stuck managing your pools/workers within Airflow's paradigm, which means all workload must (a) be written in Python and (b) have Airflow installed on the venv (very heavy pkg) and (c) be k8s pod or Celery (unless you write your own).


Only because you have chosen to introduce configuration and maintenance complexity by using airflow as enterprise wide middleware.

In a modern even based SOA, products like airflow are a sometimes food while pub/sub is the default.

Perhaps a search for images of the zachman framework would help conceptualize how you are tightly coupling to the implementation.

But also research SOA 2.0, or event based SOA, the Enterprise Service Bus concept of the original SOA is as dead as COBRA.

ETA: the minimal package load for airflow isn't bad, are you installing all of the plugins and their dependencies?


We only use KubernetesOperators, but this has many downsides, and it's very clearly a 2nd thought of the Airflow project. It creates confusion because users of Airflow expect features A, B, and C, and when using KubernetesOperators they aren't functional because your biz logic is separated. Eg., if your biz logic knows what S3 it talks to in an external task, how can Airflow? So now its Dataset feature is useless.

There are a number of blog posts echoing a similar critique[1].

Using KubernetesOperators creates a lot of wrong abstractions, impedes testability, and makes Airflow as a whole a pretty overkill system just to monitor external tasks. At that point, you should have just had your orchestration in client code to begin with, and many other frameworks made this correct division between client and server. That would also make it easier to support multiple languages.

According to their README: https://github.com/apache/airflow#approach-to-dependencies-o...

> Airflow has a lot of dependencies - direct and transitive > The important dependencies are: SQLAlchemy, Alembic, Flask, werkzeug, celery, kubernetes

Why should biz logic that just needs to run Spark and interact with S3 now need to run a web server?

[1] Anecdotes from various posts - https://medium.com/bluecore-engineering/were-all-using-airfl... - https://eng.lyft.com/orchestrating-data-pipelines-at-lyft-co... - https://dagster.io/blog/dagster-airflow

> Airflow, in its design, made the incorrect abstraction by having Operators actually implement functional work instead of spinning up developer work.

> By simply moving to using a Kubernetes Operator, Airflow developers can develop more quickly, debug more confidently, and not worry about conflicting package requirements.

> Airflow lacks proper library isolation. It becomes hard or impossible to do if any team requires a specific library version for a given workflow

> There is no way to separate DAGs to development, staging, and production using out-of-the-box Airflow features. That makes Airflow harder to use for mission-critical applications that require proper testing and the ability to roll back

> Data pipelines written for Airflow are typically bound to a particular environment. To avoid dependency hell, most guides recommend defining Airflow tasks with operators like the KubernetesPodOperator, which dictates that the task gets executed in Kubernetes. When a DAG is written in this way, it’s nigh-impossible to run it locally or as part of CI. And it requires opting out of all of the integrations that come out-of-the-box with Airflow.


Airflow is far from perfect, but I don't understand your concerns. I work in a big and messy company and even messier department. We have jobs running in Databricks, Snowflake, sometimes we read data from API end points, or even files uploaded to SharePoint (my group is not building DW). Airflow lets me organize it in a single workflow. At least I know that every failed job is reported by email and I don't need to search multiple systems - all starts from Airflow.

> Why should biz logic that just needs to run Spark and interact with S3 now need to run a web server?

Webserver is mostly UI. Scheduler service triggers the jobs.

We have groups which run everything as Bash Operator, no dependency issues that way.

You maybe have a very specific use case in mind, the main points of using Airflow for me

* Single orchestration center: manual job control (stop, pause, rerun), backfill; automated scheduler/retry; built-in notification

* Framework built around "reporting period" - it enforces correct abstraction, if a data batch is broken, I can rerun it and rerun all dependent downstream. How do you fix data in event driven workflow?

* managing dependencies

In most cases all Airflow does is running your job with passing it "date" parameter. You can test your code without Airflow - just pass it a date and run from command line.


There is nice summary on the topic: https://aws.amazon.com/blogs/big-data/choosing-an-open-table... ("Optimizing read performance"). Those technologies primary "Data Management at Scale" but they also extend capabilities provided by raw storage formats such as parquet. So they may help you, but the question if you are really need it. I haven't worked with BigQuery, it may include [similar features](https://cloud.google.com/bigquery/docs/search-index).

You need to define what "latency" means in your case and what is "quite high levels". We are talking about analytical data storage, it is designed for efficient batch processing. To find a single record is not a primary goal of the architecture - you will need some kind of caching/indexing for fast search. Sometimes adding "limit 1" for your single record search may solve the problem.

Be sure you are using efficent data storage format as parquet, check size of the files to be sure you don't have the ["small file problem"](https://www.royalcyber.com/blog/data-services/managing-small...), then check if you are using relevant BigQuery features. And before and after those checks run "explain" on your query, if you don't use partition keys or indexed columns your search results won't be instant in any big data system.


>pendulum is probably your best bet

It is best until you work with Pandas DataFrame (https://stackoverflow.com/questions/47849342/making-pandas-w...)


Numpy and pandas are their own little island. It's not just dates and time, it's everything.

If you use numpy and pandas, you should also not use Python datetime, generators, most stdlib mathematical functions, the itertools module, random, etc.

It's the first thing you learn if you read any good pandas book, and the first thing I teach in my numpy/pandas trainings.

It has pretty much nothing to do with pendulum.

Basically, half the Python ecosystem is "well, except with numpy/pandas of course".


    from zoneinfo import ZoneInfo
    import pendulum
    import datetime

    def pendulum_to_datetime(pendulum_dt: pendulum.DateTime) -> datetime.datetime:
        return datetime.datetime.fromtimestamp(pendulum_dt.timestamp(), 
    ZoneInfo(pendulum_dt.timezone_name))
    
    # test 
    
    df = pd.DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])
    df["time_column"] = pendulum_to_datetime(pendulum.now())
    
    print(df)
output

       a  b                      time_column
    0  1  2 2023-11-19 18:14:16.027777-05:00
    1  1  2 2023-11-19 18:14:16.027777-05:00


    >> df.dtypes

    a                                         int64
    b                                         int64
    time_column    datetime64[ns, America/New_York]
    dtype: object


I doubt it was .exe, probably .msi (installer) or zip file.


no sure it is relevant, probably all heavy lifting done in C++ and Rust https://searchfox.org/mozilla-central/search?q=render&path=&...



I think the article and most of the commentators are missing a real problem. The real problem is that there is an underlying complexity, which has no simple solution - collaboration of teams/groups of people with different expertise and even different interests is a universally unsolved problem. When a company passes the "startup" stage and gets real customers, it isn't a startup anymore. It has more tasks to handle and needs more people to work, people with different expertise. There are sales, operations, developers (with multiple expertise), lawyers, coordinators (of all kinds). Innovation is not a self-evident target, at some point startup needs to cash innovation. There are frameworks which attempted to address that complexity, for example scrum or more flexible superset "agile manifesto" or even waterfall. They all may work or fail, but there is no guarantee, no recipe.


I am using Java occasionally, there isn't any special learning curve if you know basic syntax. Same as in Python you don't start with Jdango and virtual environments, you shouldn't start using Java with Spring Boot/Spring.

Google library you need "java http client"

https://openjdk.org/groups/net/httpclient/intro.html - it doesn't look complicated.

First example is using "reactive-streams", you don't have to use that syntax if you aren't familiar (same as in Python there are new features, which beginners may learn later), other examples strait forward.

If library you need doesn't exist in standard library, download jar, add it your `classpath` and use.

Skip any tutorials which are using Spring (unless you want to learn Spring). If you want to write more advanced Java code, learn about [Maven](https://maven.apache.org/)

Same thing about JavaEE/JavaSE, if you google about it, you will find that you may use any, there are different libraries included in installation, it doesn't mean you can't add them later.


Term "toxic" is toxic by itself.

Periods when engineer don't understand the problem should be spent on analysis of the problem domain. "Now I am working on defining the problem domain" - is an activity to work "I don't understand the problem" task. During that period probably zero code will be written.

> That author only encounters problems that have been fully solved before

He doesn't, otherwise there would be no talk about "plan B" and risks. When you actively write a project code, you should know that solution is possible. Having plan doesn't mean "problems have been fully solved before". You may have POC which doesn't end in resolution, but it should be clear what is POC for and a failure is possible outcome.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: