Software lessons from distributed systems

This semester I worked in the first non-trivial-size software engineering course project, ECE419, where a team of three students wrote a distributed key-value storage service using Java. Reflecting on the team work, there are several things I found important for software engineering teams:

  1. Remember that we make mistakes.
  2. Be open to different expectations.
  3. Workload balancing remains an open question.

1. Remember that we make mistakes

We are not idealistic software engineers. In idealistic software engineering paradigms in textbooks (e.g., Effective Java), we learn from all sorts of good practices that, if followed well, could lead to good codes. Several examples of how “good” the codes will be could include reduced possibility to rewrite, or less time spent on debug in case there are unexpected teardowns. With that being said, as 4th year students, we are far from idealistic software engineers. To be honest, our code qualities would be roasted when scrutinized using the coding standards of major software companies. The imperfectness in our codes could result from (1) algorithms, (2) unfamiliar tools, and (3) designs. These three types of imperfectness are sorted in increasing difficulty of elimination.
The first type, algorithmic glitches, are errors that break the desired functionalities of software projects. We should take all possible approaches to eliminate them, by e.g., thorough coverage of unit tests. Another approach, frequent practice of LeetCode could help decrease algorithmic error in the first place.
The second type, unfamiliar tools, could also lead to the meltdown of the whole system, which should be eliminated as well. StackOverflow and official documentations are good helpers, and we should really make sure we know what every block of code performs before typing them into our IDEs. In addition, noting down the versions of third-party packages provide grounds for adaptation, in case their update changes some APIs (e.g., TensorFlow when they upgraded from 0.x to 1.0, and PyTorch from 0.4 to 1.0).
The third type, suboptimal designs, might not directly result in malfunctioning codes, but these poor design choices could lead to redundant and hard-to-maintain codes. Based on current level of software engineering experience, I can’t give definite criteria of which designs are poor. However, these following lines appear meaningful to me:

  • We should not create much more states than necessary. Increased number of states crank up the difficulty of both coding and maintenance. For example, if we want to instantiate an object (in its constructor) carrying several fields, and dependent on user scenarios, some fields might be kept empty, it might be a good idea to iterate through the fields, instead of through possible use cases. Theoretically, for an object with $N$ fields, there could be $2^N$ use cases (each field is empty or not). Traversing the fields involve handling less states. By the way, relying on the default behaviors of gson (with some annotations) requires writing even less codes.
  • We should not let lower-level objects perform higher-level decisions. In Robotics, this is the principle behind the “subsumption architecture”. In a road robot example with subsumption architecture, lower-level modules (e.g., path tracking module, letting the robot to follow existing paths) handle their responsibilities, while higher-level modules (e.g., path planning module, deciding proper paths to go to) “subsumes” lower-level ones. More importantly, when the path tracking module is functioning, it controls how the wheels operate, and do not consider modifying the paths. This design methodology reduces the inter-dependencies between modules from a graph-like structure into a tree-like structure.
    Note that team members might disagree on these design principles. It is therefore necessary to talk over some important decisions, before proceeding with writing the implementations.

All three types of mistakes are easier to occur when team members are in the passive “zones”. On one hand, time in the day heavily affects these “zones”. I remember a day when we stayed until 12am after grabbing some dinner. We spent 4 hours just to find out a stupid bug that should otherwise be discovered within half an hour in the morning. On the other hand, continuous working tend to pull people from the “active zone” into the “numb zone”. Another incidence I remember happened when two of us found a bug and was debugging when the third member was out having dinner. When he was back, his rapidness in thought when tacking the bug turned out to be much faster than us two (who probably had worked on this project almost non-stop for 8 hours).

2. Be open to different expectations

Nothing hurts more than the quit of an active team member. The remaining people have to modify the task allocations and, if necessary, fix the problems left by this gone person. However, quitting team projects happens a lot in university course projects. This is partly because the projects usually do not weigh as much as exams, and that completing them require significant time dedications.
To reduce the potential damages of these incidents, we should understand the expectations of every team member, even before the project starts. At least everyone should be aware of several things of other team members:

  • Amount of time expected to spend on this project. Is it at most several hours per week, or is the time dedication mostly distributed towards the deadline?
  • Expected quality of this project. Is “good enough is good enough”, or “things should be perfectly done”? If a project milestone is optional, is time going to be spent on it or not?
  • Expected distribution of workloads among people. Namely, who is in charge of which functionality.

A lot of these expectations could be impacted by the ownership mentality. In a previous company I worked at (TripAdvisor), people emphasized that “you own your codes”. The point of this emphasis is to give team members the feeling of ownership. The unspoken words behind “you own these codes” is, “you are leading projects on this block of code”. Considering that the team lead is usually the most motivated person, appointing someone to “lead” encourages their spontaneity to some extents. This strategy seems to work especially well for those people with high capability (hence higher expectations for themselves).

Another impact factor of these expectations is the team dynamics. Team members should feel comfortable working with each other. This requires us to try to avoid making other people feel disrespected. For many, this equals giving enough respect to their time and efforts, including giving compliments towards their good ideas or decisions. Here are several counter-examples:

  • “Your work is not good. Let me do it.” This is basically a bomb because it can very easily be interpreted as an ad hominem (i.e., towards the person) attack. In this case, you likely won’t be able to work together any more.
  • “Can you do this and that? I just don’t want to.” This could be problematic, dependent on the contexts. For example, if the speaker expects the listener to be able to carry over every residual tasks easily, while the listener expects both people to spend approximately equal amount of effort. Implicit mismatch of expectations could potentially lead to communication teardown.
  • “This course / project / is a piece of crap.” This type of sentences are not desirable, partly because this is criticism but not a constructive one. This type of sentences do not aim at solving problems or making the situation better — they aim at expressing feelings instead.

    Careful wording should be used when communicating the expectations and creating team dynamics.

3. How to balance the workload?

This seems like an open question to me. In university course projects, balancing the workload might seem like an easy task: there has to be one person carrying the team. The person is going to basically write all codes. This type of “collaboration” is prohibited in ECE419, because the TAs ask arbitrary questions to every member. More detailed questions for workload balancing include:

  • How to properly divide tasks in the first place? Ideally the workload should be divided into chunks so that everyone is able to complete their chunk with comparable quality, at approximately the same time, and these chunks should not be mutually blocking. Probably this could be made possible if people have worked together for long enough, and the manager is experienced enough?
  • When people do not finish their current chunk at about the same time, how to dynamically adjust the residual workload?
  • Even if the ideal workload division was achieved, what if the design decision changed halfway? (shifting to an easier solution or going with plan B, etc.)

4. What makes a team well-functioning?

A good team enables people (with varying frequencies to make mistakes, and distributed expectations) to happily work together to contribute to a centralized goal.