Uploaded results from the research surveys

2021-05-12 19:52:18 +03:00 · 2016-05-18 11:29:54 +01:00
parent 1700702e26
commit 58c79f8641
4 changed files with 190 additions and 0 deletions
--- a/docs/dictionary-vs-machine-sa.png
+++ b/docs/dictionary-vs-machine-sa.png
--- a/docs/references.md
+++ b/docs/references.md
@@ -0,0 +1,96 @@
+# REFERENCES
+
+Meesad, P. (2014). Stock trend prediction relying on text mining and sentiment analysis with tweets. Information and Communication Technologies (WICT), 2014 Fourth World Congress on. 11 (3), 257 - 262.
+
+Mahmood, T. Iqbal, T. ; Amin, F. ; Lohanna, W. ; Mustafa, A.. (Dec. 2013). Mining Twitter big data to predict 2013 Pakistan election winner. Multi Topic Conference (INMIC) International. 16 (5), p49 - 54.
+
+Cheng, D. Oculus Inf. Inc., Toronto, ON, Canada Schretlen, P. ; Kronenfeld, N. ; Bozowsky, N. ; Wright, W.. (2013). Tile based visual analytics for Twitter big data exploratory analysis. Big Data, IEEE International Conference on. 8 (3), p2 - 4.
+
+Wenbo Wang Kno.e.sis Center, Wright State Univ., Dayton, OH, USA Lu Chen ; Thirunarayan, K. ; Sheth, A.P.. (2012). Harnessing Twitter "Big Data" for Automatic Emotion Identification. Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom). p587 - 592.
+
+Rahnama, A.H.A. Dept. of Math. Inf. Technol., Univ. of Jyvaskyla, Jyvaskyla, Finland . (2014). Distributed real-time sentiment analysis for big data social streams. Control, Decision and Information Technologies (CoDIT), 2014 International Conference on. p789 - 794.
+
+Wickramaarachchi, C. Kumbhare, A. ; Frincu, M. ; Chelmis, C. ; Prasanna, V.K.. (4-7 May 2015). Browse Conference Publications > Cluster, Cloud and Grid Compu ... Help Working with Abstracts Back to Results Real-Time Analytics for Fast Evolving Social Graphs. Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on. 1 (1), p829 - 834.
+
+Hamed, A.A. ; Dept. of Comput. Sci., Univ. of Vermont, Burlington, VT, USA ; Xindong Wu. (June 27 2014-July 2 2014). Does Social Media Big Data Make the World Smaller? An Exploratory Analysis of Keyword-Hashtag Networks. Big Data (BigData Congress), 2014 IEEE International Congress on. 1 (1), 454 - 461.
+
+Vu Dung Nguyen ; Big Data Lab., Univ. of St Andrews, St. Andrews, UK ; Varghese, B. ; Barker, A.. (6-9 Oct. 2013). The royal birth of 2013: Analysing and visualising public sentiment in the UK using Twitter. Big Data, 2013 IEEE International Conference on. 1 (1), p46 - 54.
+
+Osman, A.H. ; Fac. of Comput. Sci. & Inf. Syst., Univ. Teknol. Malaysia, Skudai, Malaysia ; Salim, N.. (26-28 Aug. 2013). An improved semantic plagiarism detection scheme based on Chi-squared automatic interaction detection. Computing, Electrical and Electronics Engineering (ICCEEE), 2013 International Conference on. 1 (1), p640 - 647.
+ 
+Office of Information Services. (2008). Selecting a Development Approach. Centre for Technology in Government. - 3-4.
+Linus Torvalds. (2016). Git. Available: https://git-scm.com/. Last accessed March 1st.
+
+expressjs. (2014). body-parser. Available: https://github.com/expressjs/body-parser. Last accessed August 2015.
+
+jashkenas. (2011). coffee-script. Available: https://github.com/jashkenas/coffeescript. Last accessed August 2015.
+
+expressjs/cookie-parser · GitHub. 2015. expressjs/cookie-parser · GitHub. [ONLINE] Available at: https://github.com/expressjs/cookie-parser. [Accessed 12 October 2015].
+
+visionmedia/debug · GitHub. 2015. visionmedia/debug · GitHub. [ONLINE] Available at: https://github.com/visionmedia/debug. [Accessed 12 October 2015].
+
+strongloop/express · GitHub. 2015. strongloop/express · GitHub. [ONLINE] Available at: https://github.com/strongloop/express. [Accessed 12 October 2015].
+
+jadejs/jade · GitHub. 2015. jadejs/jade · GitHub. [ONLINE] Available at:https://github.com/jadejs/jade. [Accessed 12 October 2015].
+
+Automattic/mongoose · GitHub. 2015. Automattic/mongoose · GitHub. [ONLINE] Available at: https://github.com/Automattic/mongoose. [Accessed 12 October 2015].
+
+expressjs/morgan · GitHub. 2015. expressjs/morgan · GitHub. [ONLINE] Available at: https://github.com/expressjs/morgan. [Accessed 12 October 2015].
+
+petehunt/node-jsx · GitHub. 2015. petehunt/node-jsx · GitHub. [ONLINE] Available at: https://github.com/petehunt/node-jsx. [Accessed 12 October 2015].
+
+facebook/react · GitHub. 2015. facebook/react · GitHub. [ONLINE] Available at: https://github.com/facebook/react. [Accessed 12 October 2015].
+
+Socket.IO. 2015. Socket.IO. [ONLINE] Available at: http://socket.io/. [Accessed 12 October 2015].
+
+Node.js. (2016). Node.js Website. Available: https://nodejs.org/en/. Last accessed February 2016.
+
+Adrian Mejia. (2014). MEAN Stack - MongoDB ExpressJS AngularJS NodeJS (Part III). Available: http://adrianmejia.com/blog/2014/10/03/mean-stack-tutorial-mongodb-expressjs-angularjs-nodejs/. Last accessed 4th January 2016.
+
+Kai Lei ; Yining Ma ; Zhi Tan. (19-21 Dec. 2014). Performance Comparison and Evaluation of Web Development Technologies in PHP, Python, and Node.js. Computational Science and Engineering (CSE), 2014 IEEE 17th International Conference on. 1 (2), 2-5.
+
+Stefan Tilkov ; Steve Vinoski. (Nov.-Dec. 2010). Node.js: Using JavaScript to Build High-Performance Network Programs. IEEE Internet Computing . 14 (6), 80 - 83.
+
+Andres Ojamaa ; Karl Düüna. (2014). Assessing the security of Node.js platform. Internet Technology And Secured Transactions, 2012 International Conference for. 2 (1), p348 - 355.
+
+Xiao-Feng Gu ; Int. Centre for Wavelet Anal. & Its Applic., Univ. of Electron. Sci. & Technol. of China, Chengdu, China ; Le Yang ; Shaoquan Wu. (2014). A real-time stream system based on node.js.Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 2014 11th International . 11 (1), 479 - 482.
+
+Andrew John Poulter ; Fac. of Eng. & the Environ., Univ. of Southampton, Southampton, UK ; Steven J. Johnston ; Simon J. Cox. (14-16 Dec. 2015). Using the MEAN stack to implement a RESTful service for an Internet of Things application. Internet of Things (WF-IoT), 2015 IEEE 2nd World Forum on. x (x), 280 - 285.
+
+TJ Holowaychuk. (2016). Express. Available: http://expressjs.com/. Last accessed March 2016.
+
+gotwarlost. 2014. Istanbul. [ONLINE] Available at: https://github.com/gotwarlost/istanbul. [Accessed 01 March 16].
+
+TJ Holowaychuk. 2011. Mocha JS. [ONLINE] Available at: https://mochajs.org/. [Accessed 01 March 16].
+
+The Chai Assertion Library. 2015. Chai JS. [ONLINE] Available at: http://chaijs.com/. [Accessed 01 March 16].
+
+Core Less Team. 2015. Less CSS. [ONLINE] Available at: http://lesscss.org/. [Accessed 01 March 16].
+
+CoffeeScript. 2015. CoffeeScript. [ONLINE] Available at: http://coffeescript.org/. [Accessed 01 March 16].
+
+gulp. 2015. gulp. [ONLINE] Available at: http://gulpjs.com/. [Accessed 01 March 16].
+
+NGINX Inc. 2015. NGINX. [ONLINE] Available at: https://www.nginx.com/. [Accessed 01 March 16].
+
+The CentOS Project. 2015. CentOS. [ONLINE] Available at: https://www.centos.org/. [Accessed 01 March 16].
+
+JetBrains. 2015. WebStorm. [ONLINE] Available at: https://www.jetbrains.com/webstorm/. [Accessed 01 March 16].
+
+GitHub. 2015. GitHub. [ONLINE] Available at: https://github.com/. [Accessed 01 March 16].
+
+git. 2015. Git. [ONLINE] Available at: https://git-scm.com/. [Accessed 01 March 16].
+
+CONTRIBUTORS. 2015. Socket.io. [ONLINE] Available at: http://socket.io/. [Accessed 01 March 16].
+
+Node.js Foundation. 2015. Express.js . [ONLINE] Available at: http://expressjs.com/. [Accessed 01 March 16].
+
+Node.js Foundation. 2015. Node.js. [ONLINE] Available at: https://nodejs.org/en/. [Accessed 01 March 16].
+
+IBM. 2015. IBM Watson. [ONLINE] Available at: http://www.ibm.com/smarterplanet/us/en/ibmwatson/. [Accessed 01 March 16].
+
+HP. 2015. Haven OnDemand. [ONLINE] Available at: https://www.havenondemand.com/. [Accessed 01 March 16].
+
+Twitter. 2015. Twitter API. [ONLINE] Available at: https://dev.twitter.com/rest/public. [Accessed 01 March 16].
+
+Google. 2015. Google Places API. [ONLINE] Available at: https://developers.google.com/places/. [Accessed 01 March 16].
--- a/docs/research-1-sa-current-uses.md
+++ b/docs/research-1-sa-current-uses.md
@@ -0,0 +1,26 @@
+#	SENTIMENT ANALYSIS:
+> AN EXECUTIVE ASSESSMENT OF REPRESENTING TRENDS IN OPINIONS EXPRESSED ON SOCIAL MEDIA
+
+
+## Abstract
+The purpose of this literature review is to explore how sentiment analysis can be used to visually represent people’s opinions, and more specifically how theses visual representations can illustrate trends of sentiment mapped against factors such as time, location or topic. It will also touch on the different methods of analysing sentiment as this dictates what type of trends can be mapped.
+
+## Introduction
+Despite the mass amounts of data uploaded every day to public social media networks (500 million Tweets per day), and the advances in sentiment analysis and natural language understanding. Research into applying sentiment analysis to social media data to find trends has been limited. The purpose of this review is to explore the potential of gaining an insight into an overall attitude towards specific topics from analysing social media channels, and how the trends from these attitudes can have practical applications.
+
+## Related Work and Current Applications for Twitter Data
+Twitter data can be used to show trends and then make predictions for the future based on historical events. Meesad (2014) outlines how trends from past events can be mapped to stock prices, and this, in turn, can be used to reasonably accurately predict future stock prices, aiding investors to make better trading decisions. 
+The social Web is also being commercially exploited for purposes such as automatically extracting customer opinions about products or brands (Bansal et. Al 2004). This gleaning of Twitter data can then be used to gain a deeper understanding of people’s behaviour and actions (Wenbo Wang et. al 2012). One key use for this insight into people opinions would be to aid marketing campaigns, as companies will have a better understanding of what techniques were effective in successfully marketing a product or service.
+Dr. Tariq Mahmood et. Al (2013) describes how in the 2013 Pakistani Elections, they were able to use a large set of Twitter data from Pakistan, to accurately predict the winners of the election. The algorithm worked by applying a series of predictive rules to the data and categorising Tweets based on which rules they followed. This again	provides a valuable insight into the country’s political future before any official results have been released, and information like this couldn’t possibly be collected on this scale with any conventional data collection method.
+
+
+## Comparing the different methods of sentiment analysis 
+There are several different ways of calculating the sentiment from a string of text, but two key underlying approaches. The first is a dictionary based method, this is where there is a database of words each with a score of how positive or negative the word is. An algorithm then calculates the overall aggregated score for a given string based on the database. The second method uses natural language understanding and machine language to return much more detailed and accurate results. However, this is not readily available without large amounts of computing power nor is it easily possible on large sets of data.
+Within the machine language method, there are a number of different algorithms that can be used. Dr. Tariq Mahmood et. Al (2013) shows a good comparison between three of them on page 4 of his paper. The CHAID algorithm was slightly more accurate than the Naïve Bayes and Support Vector Machine but the results were close. These algorithms are used as the base for the data structure which in turn dictates what order it gets sorted, which will directly affect results as the system learns as it goes along. This works in a similar way to the semantic text plagiarism detection technique outlined by Osman (2013) which looks at semantic allocations of each sentence to gain an understanding of what the underlying message is.
+Vu Dung Nguyen (2013) quantivley compares both the dictionary-based and the machine learning approach to sentiment analysis. In this paper, they were studying reactions to the Royal birth. 
+A key difference between the efficiency of the two approaches is that the machine learning method requires each Tweet to be analysed individually and then the results are aggregated, however the dictionary-based approach can analyse all Tweets at once, as it is simply assigning a score by each word. The following diagram illustrates this concept.
+ 
+ 
+![Fig 1 	Dictionary Vs Machine learning SA](dictionary-vs-machine-sa.png) 
+
+When comparing the results of dictionary-based methods and machine language methods, there are some differences. The dictionary-based results seem to be consistently lower (more negative) than the natural language results, however, the overall sentiment trend that is plotted is very consistent between the two methods. This is the difference in sentiment value is likely to be caused by there being significantly more negative words in the English language than positive words.
--- a/docs/research-2-sa-comparison.md
+++ b/docs/research-2-sa-comparison.md
@@ -0,0 +1,68 @@
+# SENTIMENT ANALYSIS: COMPARING TECHNICAL APPROACHES
+
+## What is Sentiment Analysis?
+Sentiment analysis is the process of computationally identifying opinions expressed in a piece of text, to determine the overall attitude conveyed. At its most basic level, this could be resolving the string into an integer score that represents positivity. It can, however, go a lot further, and identify keywords in the text and then compute what the authors feelings and attitudes are towards that topic. The results are of course just subjective impressions and not facts, but with large sets of data can build up a very accurate representation of people’s opinions.
+There is a growing demand for SA to make sense of a large amount of data representing people’s opinions. It can be used to understand attitudes conveyed in mass amounts of Twitter data, or to analyse product reviews or to categorise customer emails, to name just a few of its applications.
+The purpose of this paper is to carry out some quantitative and qualitative research compares different readily-available methods of SA. The two most common SA methods are dictionary based and natural language understanding based model. As a benchmark the results from both of these will also be compared with human computed values, which are likely to be much more accurate although considerably slower to compute.
+
+## Dictionary-Based Sentiment Analysis
+The lexicon-based approach involves calculating orientation for a document from the semantic orientation of words or phrases in the document (Turney 2002). This is usually done with a predefined dataset of words annotated with their semantic values, and a simple algorithm can then calculate an overall semantic score for a given string. Dictionaries for this approach can either be created manually (see also Stone et al. 1966; Tong 2001), or automatically, using seed words to expand the list of words (Hatzivassiloglou and McKeown 1997; Turney 2002; Turney and Littman 2003). 
+
+## Natural Language Understanding Approach 
+The natural language understanding (NLU) or text classification approach involves building classifiers from labelled instances of texts or sentences (Pang, Lee, and Vaithyanathan 2002), essentially a supervised classification task. There are various NLU algorithms, the two main branches are supervised and unsupervised machine learning. A supervised learning algorithm generally builds a classification model on a large annotated corpus. Its accuracy is mainly based on the quality of the annotation, and usually the training process will take a long time. Unsupervised uses a sentiment dictionary, rather like the lexicon-based approach, with the addition that builds up a database of common phrases and their aggregated sentiment as well.
+
+ 
+## Research Plan Methodology
+I am going to conduct a research experiment to compare the results produced by natural language understanding (NLU) SA methods to the dictionary based SA approach. There will also be a set of results computed by a human to be used as a benchmark. 
+
+#### The natural language understanding component
+The HP Haven OnDemand API is a powerful natural language understanding engine, and will be used for the NLU component. It is free to use for a limited number of requests and provides more details that just an aggregate sentiment score. I have developed a custom wrapper module for this experiment, and the source code and documentation for it can be viewed at https://goo.gl/NTXfyp 
+
+### The dictionary-based component
+I have developed a simple dictionary-based algorithm, and have packaged it up as a standalone module, this will be used for the dictionary-based component. For the dataset, it will make use of the AFINN-111-word list, which is a comprehensive list of English words annotated with an integer for valence. The source code and documentation can be viewed at https://goo.gl/gU4f9A 
+
+### The human component
+To provide a basic benchmark for results, a survey including a sample of Tweets that will be analysed by the two systems will be drawn up. Participants of the research will be asked to rate each Tweet with a score between 0 (very negative) and 10 (very positive) with 5 being neutral. Since the survey will be considerably more time-consuming than the other two methods, only a sample of the data will be analysed by the five participants.
+
+### Data source
+The data source will be Tweets regarding the Edward Snowden case in 2013. The Twitter API will be used to supply the Tweets. I have written and published a custom module to facilitate the easy fetching of relevant Tweets, see https://goo.gl/WQy7gI for source code and documentation.
+
+### Rendering and Displaying Results
+Since the results will be dynamic (able to change if a different query is passed in), the charts must be flexible. A combination of Google Charts JavaScript library and a custom module written in D3.js will be used.
+
+
+> To view the final solution online, and check out the comparison tool for yourself, visit: [http://sentiment-sweep.com/sa-comparison](http://sentiment-sweep.com/sa-comparison)
+
+> There are also links to the documentation from here, and a brief explanation of how it works.
+
+## The Result
+
+![graph of results](presentation/img/survey-results.png)
+
+### Description of Graph
+Each connected point (three or fewer dots, connected with a single vertical line) represents a Tweet. Where each dot is calculated sentiment analysis result, the light blue is dictionary based results, mid-blue is NLU-based results and the dark blue are the benchmark results calculated by humans in the survey. The lines between the points indicate that they were generated from the same Tweet. Some points do not have lines because the dictionary result was exactly the same as the NLU result. The x-axis shows Tweet length (i.e. string length between 0 - 160). The y-axis a measure of overall sentiment, between -1 and +1, where -1 is the lowest possible value, and +1 is the highest possible value.
+
+### Generating the Graph
+This graph was rendered using Google Charts and then dynamically modified with a script written in D3.js. Because this graph is dynamic, it is possible for the user to enter any search term, and the system will fetch relevant Tweets, then run the sentiment scripts on those Tweets and generate a similar looking graph. This process is fully automated, and typically takes 5 – 8 seconds.
+
+### Results
+There are several findings from this research.
+
+Firstly, there is a clear relationship between the length of the input string (in this case a Tweet) and the accuracy of the results. The longer the input text, the closer together the NLU, dictionary and human results, in most cases. This is because a better understanding of what it being conveyed can be grasped in longer sentences.
+
+The dictionary-based results tended to produce more neutral values, whereas the NLU method was able to distinguish positivity and negativity in most tweets. This is because it is able to interpret actual semantic meaning from the sentence as opposed to just looking and the positivity of words.
+
+The overall average sentiment produced by the NLU dataset for the Edward Snowden dataset was 0.028 (very close to neutral as a lot of very positive and very negative tweets cancelled each other out), and the average for the dictionary based approach was 0.019 (again very neutral). This difference of 0.009 show’s that the two methods, despite being very different overall produced results not that far out.
+
+Finally, there were some cases (on other datasets), where the sentence was using very positive sounding words to convey a sarcastic message. In some cases, the NLU method was able to distinguish this, and gave an appropriate sentiment score, however, the dictionary-based approach failed miserably.
+
+
+##	Summary of comparison between different SA approaches
+
+![key findings](presentation/img/findings.png)
+
+The following radar chart illustrates how dictionary-based method compares with the NLU approach. The data was calculated based on the custom written dictionary approach mentioned above, and the HP Haven NLU sentiment analysis engine.
+
+The radar chart illustrates how although NLU SA is significantly more accurate and returns very detailed results, it is certainly not scalable for larger solutions, nor is it fast (hence not suitable for real-time data), and is not cheap to implement and maintain either. 
+
+In conclusion, although the natural language understanding approach is able to deliver more accurate results and distinguish a wider variety of emotions, it takes a lot longer to complete each request, and also requires considerably more computing power, both of which means that it is less cost effective and scalable for larger solutions.