FileNotFoundException: Permission denied - A chase through the OpenJDK, Python and Linux source

written by Lars Francke on 2018-08-06

Introduction

In this short post I'll tell the story of how we debugged an issue that prevented us from starting YARN.

The problem

Our problem was that we tried to start a Hadoop YARN ResourceManager and failed. We configured it to enable SSL/TLS and read its keystore from /opt/hadoop/keystore.jks with these permissions hdfs:hadoop 640. We started the ResourceManager as user yarn who belongs to the group hadoop.

Everything should work, right? keystore.jks belongs to group hadoop. Members of that group, which we are, should be able to read it. Instead we got a permission denied exception like this:

java.io.FileNotFoundException: /opt/hadoop/keystore.jks (Permission denied)

Of course we went on Google to figure out what went wrong and everyone was telling us that we need to have the proper permissions on every part of the path. So we also need to have read and execute permissions on /opt and /opt/hadoop. Unfortunately that was already the case.

This post details all the steps we took to debug the issue and finally provides the solution.

A quiz

Let's start with a quiz. Something that I didn't know either but learned while investigating this issue.

Running this code:

mkdir test
chmod 777 test
cd test
touch foobar
chmod 700 foobar
cat foobar
chmod 070 foobar
cat foobar

What result would you expect, assuming foobar belongs to your own user and group?

I expected it to show me the result two times.

What really happens however is a permission denied error on the second cat:

cat: foobar: Permission denied

We did turn to the POSIX spec but that (IMO) is not super clear. Wikipedia however gave us the answer:

The effective permissions are determined based on the first class the user falls within in the order of user, group then others. For example, the user who is the owner of the file will have the permissions given to the user class regardless of the permissions assigned to the group class or others class.

Ah! So it checks my user and that user does not have access so it doesn't even consider the group or other classes.

Our search for the problem

Anyway, back to the issue at hand.

We checked everything we could think of: First we checked SELinux using sestatus but it was disabled. Next, we tried logging in as yarn and read the file from the bash. Surprisingly that also worked. Weird. Then we checked (using ps) whether the ResourceManager process actually does run as the proper user. At this time we were down to checking the uid and gid and did not rely on the names itself. But...those also matched.

To recap: A Java process running as user yarn gets a Permission denied reading a file. A bash shell logged in as yarn can read said file.

As the next step we ripped out the exact YARN code that accesses the file, which boils down to something like this:

FileInputStream stream = new FileInputStream(file);
int firstByte = stream.read();
stream.close();

Which is not very interesting, but we ran it anyway. It worked!

Next step: Attaching a debugger to the running ResourceManager process. jdb to the rescue as we didn't have access to the server directly to attach a remote debugger on our machines.

We attached to the ResourceManager and ran the following statements:

Calling new File("/opt").canRead() returns true
Calling new File("/opt/hadoop").canRead() returns false (WAT?)
Calling System.getProperty("user.name") returns yarn

A reminder, this is how our directory structure looks like:

/opt                     root:root   drwxr~xr~x
/opt/hadoop              hdfs:hadoop drwxr~xr~x
/opt/hadoop/keystore.jks hdfs:hadoop rw~r~~~~~

Okay, this is confusing. What's going on?

At this point we dug through the OpenJDK source code and ended up in the Linux kernel sources. Java, in the end, uses a open syscall which returns -1.

In order to replicate this I've written a short C program which narrows it down as much as possible:

#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[]) {
  printf("%d\n", access(argv[1], R_OK));
  printf("%s\n", strerror(errno));

  printf("%d\n", open(argv[1], O_RDONLY));
  printf("%s\n", strerror(errno));
}

Surprise: This also worked perfectly.

Back to the debugger. This time we created a simple shell script that did nothing but echo the result of id -Gn and id yarn to a file. We put that shell script in /var/lib/hadoop-yarn/testscript.sh, made it executable and owned by yarn.

Then we went back to jdb to execute this piece:

print Runtime.getRuntime().exec("/var/lib/hadoop-yarn/testscript.sh")

And this is how we finally got one step closer.

The result was:

id is missing our hadoop group which would have given us access to the file
id yarn does include the group

While this was super confusing it meant that finally we could stop digging at the low level because Java and Linux now behaved as expected, the issue must be somewhere in setting up the process.

All of this was in a Cloudera environment. Cloudera uses "Agents" implemented in Python to communicate with the Cloudera Manager Server. These Agents are the ones receiving commands to do things like start and stop processes. So we looked into those. I already knew that Cloudera uses Supervisor to actually manage and supervise the processes.

Looking into how supervisor starts the processes we found this:

groups = [grprec[2] for grprec in grp.getgrall() if user in grprec[3]]
...
os.setgroups(groups)

Ha! Supervisor sets the effective groups for a process manually. Why, I don't know, but it does.

We used a Python console to check what grp.getgrall() returns and drumroll it is missing our hadoop group. Finally! This makes it easier because it's now much easier to reproduce.

So, I looked in the cpython source code and found out how getgrall() is implemented and this is an extract:

while ((p = getgrent()) != NULL) {

getgrent was something to google for and this time it was enough to actually get a result and our solution. This Knowledge Base article from Red Hat states that getgrent does not return groups from LDAP when SSSD is used unless enumerate is turned on!

This was our problem! Our groups have been defined in a central IPA instance and were not local. So they were not returned by getgrent. Our workaround is to create the groups locally instead of centrally. We didn't want to take the overhead of enumerate.

This was a fun day! I learned a lot.

TL;DR

If you have your groups in LDAP and retrieve them via SSSD then Python won't see them and subsequently supervisor won't see them which leads to processes started by supervisor not seeing them. To work around this create your groups locally or enable enumerate in SSSD.

If you enjoy working on new technologies, traveling, consulting clients, writing software or documentation please reach out to us. We're always looking for new colleagues!