1B - Электронные таблицы was extremely challenging for me; it took me over 2 hours to get AC. I didn't find the editorial helpful, and then when I went to see practice solutions from red/orange people, it made even less sense (they were basically copy/paste with minor variation):
I'll talk about each part of the problem itself and approaches/alternatives, and then explain how the above solution (henceforth "the solution") handles it. I'll go into unnecessary detail since I'm very new to CP/CF and would've wanted this level of detail myself.
Problem statement
Input: First line is an int representing the number of test cases. Each following line is a string of characters specifying a spreadsheet cell, in one of two formats:
- "Excel":
^[A-Z]+[0-9]+\$
(e.g.BB352
). - "RC":
^R[0-9]+C[0-9]+\$
(e.g.R253C55
).
Output: For test case, we must output the same cell specified in the opposite format. Example:
2
R23C55
BC23
should output
BC23
R23C55
Constraints:
- Between 1 and 10000 test cases (all valid in either format)
- In each case, the row and column numbers are no larger than 1000000 (so each line is 16 chars at longest since we never exceed
R1000000C1000000
/BDWGN1000000
).
Approach
Our implementation needs to do 3 things:
- Convert from Excel to RC
- Convert from RC to Excel
- Determine which format the input is in, and do the right conversion on it
The heart of this problem is a change of base; RC-format uses two base-10 numbers, whereas Excel uses one for rows and a quasi-base-26 character string for columns.
Determine format
Note that for RC format, we will always find a C
after some numbers in the string, whereas with Excel we will never find any letters after numbers. We could use this as a test of format with something like:
The solution does this in a better way:
The string %*c%d%*c%d
will match on the RC format, but not Excel; %*c
matches on and throws away any non-numeric characters, whereas %d
matches on numeric characters and stores them in the respective variable. For an RC string, R
and C
are thrown away but the numbers following them are stored in x
and y
(so (sscanf()
will return 2
). On an Excel string, this won't match so nothing is stored in x
or y
(and sscanf()
will return 0
).
Excel to RC conversion
This is the easier conversion, conceptually. We can copy the row portion and print it as-is. For the column portion, we have to convert from base-26 to base-10. A base-10 number represents addition of successive powers of 10, each times a constant: 12345
in base-10 means 5*(10^0) + 4*(10^1) + 3*(10^2) + 2*(10^3) + 1*(10^4)
. So for the column string, we can map each letter to a number (i.e. A=1
, B=2
, ... , Z=26
) and then associate each with a power of 26
. Consider RZA
: A=1
, Z=26
, and R=18
, so we can write it as 1*(26^0) + 26*(26^1) + 18*(26^2) = 12845
.
We have a problem: C++ has no built-in for integer exponentiation. There are a couple workarounds. We could try writing one ourselves:
or
The iterative version is slightly faster (by a few ten-thousandths of a second on my machine, probably due to the time involved with stack frame creation/cleanup) but we have bigger problems to worry about. Using unsigned ints means we can't compute negative powers, but those are ruled out by the problem statement so again we don't care.
Another approach would be to use a hashtable for powers of 26 since the problem constraints mean we'll never compute 26^n for n > 4 (though this requires more code):
However, the solution does none of these:
Let's take this apart, starting with for(x=0,p=s;*p>64;++p)
. After sscanf()
returns 0
, x
and y
are uninitialized, so x
is zeroed and will store the value of the column string as it's converted to base-10. We also set p
to the first character in the string. On each iteration, we increment p
(which will add 1 byte to the address) and then evaluate it.
Our termination condition is that the character p
points to is less than 64
; 65
is the ASCII code for A
, so during each iteration of our loop p
must point to a char that represents some letter from A-Z. The problem ensures each test case is a valid Excel or RC string, so while a test like *p>64 && *p < 91
would be more correct, it isn't necessary. When the loop ends, it will thus point to the first number in the string, i.e. the row portion of the string. We will then be able to print it directly, since it's already in base-10.
The next line is the most confusing: x=x*26+*p-64;
. First, what does it actually do? Let's add parens to make order of operations explicit:
- Dereference:
x*26+(*p)-64
- Multiplication:
(x*26)+(*p)-64
- Add / sub, with L-to-R associativity:
((x*26)+(*p))-64
The entire result is stored in x
repeatedly, so we can use substitution and write an entire loop as a single line. Using the previous example of converting RZA
to 12845
:
(((((((((0*26)+(82))-64)*26)+(90))-64)*26)+(65))-64)
then drop some redundant parens to get
((((0*26)+(82-64))*26+(90-64))*26+(65-64))
((((0)+(18))*26+(26))*26+(1))
((18*26+(26))*26+(1))
((468+26)*26)+1
12845
So we can see that this computes the right answer despite not using exponents, but why is it equivalent to our previous method using powers of 26? We can factor one into the other with a little high-school-level math:
((((0*26)+(82-64))*26+(90-64))*26+(65-64))
^------drop since it evaluates to 0
(((82-64)*26+(90-64))*26+(65-64))
^--- distribute over inner expression
(82-64)*26^2 + (90-64)*26 + (65-64)*26^0
^---26^0=1, so (65-64)*26^0=(65-64)
So we know x
will contain the now base-10 value of the column, and at the end of the loop p
points at the first number of the column portion (already base-10), so all we have to do is print them with R
and C
, and we're done with the test case:
RC to Excel conversion
Normally, we can convert from a lower base to a higher one via successive division until we get 0; the remainder after each step are the characters in the new base, starting with the least significant. For example, convert base-10 12345 to hexidecimal (which will be 0x3039):
12345 / 16 = 771 12345 % 16 = 9
771 / 16 = 48 771 % 16 = 3
48 / 16 = 3 48 % 16 = 0
3 / 16 = 0 3 % 16 = 3
But this won't work for our Excel format with base 26 because of two quirks you might have noticed by now:
- No
0
in Excel format - Off-by-one; Excel format treats A as the first symbol in the number system but uses A = 1. In decimal,
1
is the second character, after0
; so when n % 26 == 1 in our procedure, it really means "use the second character" but the format usesB
for the second character.
To make this explicit, we can try doing the normal conversion for decimal 494, but we don't get the expected result "RZ":
494 / 26 = 19 494 % 26 = 0 // ?; we have no letter for 0
19 / 26 = 0 19 % 26 = 19 // S; A = 1, B = 2, ... , S = 19
We can correct both issues with two small changes:
- If n % 26 == 0, emit "Z"
- Whenever we emit Z, decrement the quotient by 1 before doing the next division.
Now converting 494 works correctly:
494 / 26 = 19 494 % 26 = 0 // Z
19 - 1 = 18
18 / 26 = 0 18 % 26 = 18 // R; A = 1, B = 2, ... , R = 18
And other divisions do as well; decimal 1000000 converts to BDWGN:
1000000 / 26 = 38461 1000000 % 26 = 14 // N
38461 / 26 = 1479 38461 % 26 = 7 // G
1479 / 26 = 56 1479 % 26 = 23 // W
56 / 26 = 2 56 % 26 = 4 // D
2 / 26 = 0 2 % 26 = 2 // B
The "decrement by 1 if we emit Z" is a bit confusing, but consider this: in decimal addition when we overflow a place with a multiple of the current place, we emit 0 and carry to the next number. E.g. in 17+13, 3+7=10, so we emit 0 and carry the 1 to the 10s place so we get 30. So a sort of opposite happens here; instead of overflowing and wrapping to 0, we emit our 26th character and subtract 1 from the next place.
However, the solution once again does something different:
The helper function g()
handles converting the columns portion in y
and emits it immediately, the main loop prints the rows part stored in x
, which again requires no conversion. g()
itself requires a bit of explanation; here it is expanded:
Note that g()
does recursive division on t
, the base-10 column value; our base case / terminating condition is t == 0
(remember that this is the termination condition of our original change-of-base procedure). Recall that the original procedure obtains the characters in the new base in reverse order. g()
handles this by doing putchar()
after the recursive call returns; this means that the first subcall to print a character will be the one on top of the call stack, i.e. the most significant character.
The actual calculation here differs from repeated division discussed above. Instead of only decrementing the quotient when we emit Z, we decrement after every division and then again before calculating which letter to use. However, note that the letter choice begins from 65 ('A'). Let's convert 12845
, which results in RZA
. Does this handle our Z / remainder 0 problem?
(12845-1) / 26 = 494 (12845-1) % 26 = 0 // 65 + 0 = 65, 'A' in ASCII
(494-1) / 26 = 18 (494-1) % 26 = 25 // 65 + 25 = 90, 'Z' in ASCII
(18-1 / 26) = 0 (18-1) % 26 = 17 // 65 + 17 = 82, 'R' in ASCII
It does! And observe what happens for 1000000 (when we aren't dealing with Z / remainder 0):
1000000 / 26 = 38461 (1000000-1) % 26 = 13 // 65 + 13 = 78, N
(38461-1) / 26 = 1479 (38461-1) % 26 = 6 // 65 + 6 = 71, G
(1479-1) / 26 = 56 (1479-1) % 26 = 22 // 65 + 22 = 87, W
(56-1) / 26 = 2 (56-1) % 26 = 3 // 65 + 3 = 68, D
(2-1) / 26 = 0 (2-1) % 26 = 1 // 65 + 1 = 66, B
The actual integer quotient is unchanged, but all of the remainders are decremented. Thus adding the remainder to 65 instead of 64 is effectively undoing the subtraction, leaving us with the same procedure as before of simply dividing by 26 repeatedly. In essence, g()
implements a proper change-of-base procedure: A
is now the first letter in our alphabet, because a remainder of 0
now results in its character code 65
and Z
is still the last character.
And that's it! g()
will print the chars for the column in reverse order as described above, and then emit the rows portion immediately after, then go to the next test case.
Please comment / criticize / ask questions if you feel like this explanation can be improved in any way (I'll publish edits over time).