Project Development

Sunday, October 6, 2013

How to parse Comma Separated Values(CVS) file content with Regex in Java

CVS file is one comma separated string formatted file. It has been widely used in many fields, e.g Google Contacts Export Format, etc. It's really a good tool if your data is columnized like a table. It might be looked like:

NameAddressCell #
John12333123-456-7890
Peter13444234-567-8910

So the information is very well formatted, just like the data in an Excel Table. Then, it's a good idea to export this data to cvs file, then use the cvs file formatted data for your program.

KEY: The most important reason to use cvs formatted file: Well formatted and easy to use.

Now, for programmers like me, how can we parse it?
The data in cvs file is like a table, it's stored as rows, but separated by commas, as the name suggested.
So the intuitive idea to take care of the data is read the file line by line, then split the string you read by comma.
Code will looks like:
String s = in.readLine();
String[] str = s.split(",");

While if you think this is done, then you are 50% wrong, since most of strings might has comma in them. So it's a wise choice to use Regular Expression to split the string by comma, only if this comma is not the part of string.

The Regex expression I use is:

s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
Hope this could help you solve the problem.

Happy Coding.

Wednesday, October 2, 2013

Reading and Writing Chinese Characters with Java

Today, after I found a page on a website has tremendous information I need, I fetched it by viewing its HTML source code and then extract the information from it.

Obviously, the first job of succeeding on this task is capability of reading this source code, then write the extracted information to the target file. Since the information mainly focused on Chinese speaker viewer, most of it are written in Chinese with some encoding I barely take care with.

I firstly read the source file, simply using BufferedReader with FileReader, then print it with system print function. What I got is few English characters with question marks and other symbols. After few google search, other programmers suggest the default output encoding for Eclipse (The IDE I use mostly for Java) is "MacRoman", not what I did expected "UTF-8". Then I got the same thing with weird symbols(some of them I've never seen in my life) still even I changed the output encoding by "run-configurations -> Common -> Encoding". After this, I realize the way I read Chinese characters from file or console already produced those messy symbols, So I started to use:

   BufferedReader in = new BufferedReader(new InputStreamReader(
    new FileInputStream("input.html"), "UTF-8"));

Problem solved.

For the similar idea, I chose PrintStream to write extracted information to file.

        PrintStream out = new PrintStream("output.html", "UTF-8");

Needless to mention, PrintWriter doesn't work here for this purpose.
Key Points:

  • If something weird printed, there must be encoding.
  • Choose the way could specify encoding will be more easily handle this problem.