WSJ Web Crawler — pydata: Huiming's learning notes

Introduction

This is an introduction to the Wall Street News Counts. The idea is to crawl the online financial news of the public trading company. Do a sentiment analysis of the news and link the sentiment score with the stock price trend. It is believed the frequency of the news(volume / counts), the sentiment analysis score are good predictors to srock price trend.(see 1).

The first step is to crawl the news from different webpages like wall street journal, bloomberg and so on. Here is from WSJ, which free users can only get news title but not the contents. The second step is to clean and reformat the data. Like encoding, date and time format, exception handling. The third step is to score the contents of the news. A popular way is to score the data with different models and then pick the median or the sum of the score(is this like random forest?) The forth step is to incrementally dump the cleaned data to the database. Make sure there is no dups. The fifth step is to back up the db and the tables after dumping. You will find backup will be more and more importand in your life. The sixth step is to create some summary information tables for data visualization. Like here, I don't want to pull the db every time when someone visits my page. That will most likely cause the db crashed. Insteadly, I will save the summary information in an independent table in memory. Redis is very helpful to work as data cache. Finally the data will be sent to the frontend when it is called.

Main prograom

1. crawl wsj web

I did not find wsj have good api to download the data.(by the way, it is like yahoo finance and google finance doesn't support well to download stock data in csv. Once it was very easy to do.) But python is very powerful for web crawling. scrapy is one of the most famous and powerful tool to use. Some other light tools include BeautifulSoup, requests, urllib and so on. If using these light tools, you need to spend some time to read the source code of the webpage. There are usually some nice features that make it easy for you to find out how to write your crawler. Also, regular expression will be your friend all the time.

2. clean data

This step will include cleaning the raw data(removing html tags, filling the missing part and so on), encoding the characters(utf-8), formating the data(date and time format, numers). Some useful packages are datetime, time, re.

3. score the data

After getting the news, we need to do some transformation on the data. nltk can help to do these work: token the words, normalize the words, stemmer, creates tf-idf and calculate the distance between different documents. These will be be input for the sentimental model. Different model will be used to score the same contents to get more reliable results.

4. dump data to db

If you only create one time run job, then db might not be necessary. But if you want the job to be kicked off regularly, then dumping data to db is very very important. Please refer to this for a quick review. Since here the news will be pretty long, mysql TEXT might not work depends on your mysql version. MongoDB which is document oriented together with redis as cache will work better.

The crawling code will be run everyday to download the latest news. So only the incremental data will be saved to the database. Too frequent reading and writing will cause the db crash. Here is what I did: 1) download the news for each tickers and save them in the temp table. 2)select the latest time for that picker, and only dump the downloaded news after that time to the database. This will work like append a table to the existed table rather than doing it for each single record.

5. backup the data

Again, if for one time run, this is not necessary. But if you want to run the job regularly, it is very important to back up the data.

6. create summary information

It is not a good idea to read all the data in the database for every request. Here I will only show the number of news every week for each ticker and their stock price. A workable solution is to read the data out of the database at one time and then organize it to get their information every week. So any request from the front end will call this summary of data without having to read all the information in the database. This will not only save the time, but also reduce the possibility of system breakdown.

7. server end

The server will call the summary data generated, and then render them to the front end when the url triggers an event.

8. front end

If plot in python, matplotlib will be number 1 choice to plot static graphs. For dynamic graphs, I did not figure out which is the best. bokeh, plotly are the two that are easy to learn and powerful to use. But for charts in the front end, highcharts is a good choice. St Louis Federal Reserve has some good examples of data visualization with highcharts. A lot of work should be done here to improve the graphs. It is said you should know a back end language and a front end language.

Others

1. cron job

Set up cron job to run the program every sunday. * 8 * * 6 python /pathToCode/wsj.crawl.py will run the code every sunday at 8:00AM.

2. sync data

downloading host and web host are different. So after the data is download, it must be synced to the host server to display.

最后要说的：一入前端深似海。从html到javascript, css, jQuery, node.js, vue.js, react等等等，各种东西只能临阵磨枪，了解个皮毛。