Lin Tan
Our research on software text analytics has made to press: Engineering Dimensions (Page 42), The Record (Pages 46 & 47), Marketwired, UWaterloo News

/* Software Text Analytics */

Motivation: Software reliability and security are critically important. Software bugs and vulnerabilites greatly hurt software reliability.

A fundamental challenge of detecting or preventing software bugs and vulnerabilites is to know programmers' intentions, formally called specifications. If we know the specification of a program (e.g., where a lock is needed, what input a deep learning model expects, etc.), a bug detection tool can check if the code matches the specification.

Software text, including code comments, API documentation, and user manuals, contains a rich amount of semantic information. Software text can provide a great data source for obtaining programs' correctness information, discovering important problems, and understanding programmers' needs.

What we have done in this direction: We proposed and conducted the first studies to leverage code comments to automatically detect software bugs and bad comments. We achieve these goals by combining techniques from different areas, including natural language processing (NLP), machine learning, information retrieval, program analysis and statistics. We have analyzed various forms of software text using different techniques to address various real-world problems.

(1) cComment: Understanding comments and the potential of utilizing comments. We conduct a comprehensive comment characteristics study on 6 pieces of large software, i.e., Linux, FreeBSD, OpenSolaris, MySQL, Firefox, and Eclipse, which are different types of software (OS, server, and desktop application) and are written in different programming languages (C, C++, and Java). By studying comments written by programmers, we have learned the real needs' of programmers, which can (1) motivate the design of new techniques or improving the usability of the existing tools for improving software reliability, and (2) help developers identify pervasive and important problems and adopt some existing tools or languages for help. We learned many findings including that at least 52.6 ± 2.9% of the comments could be leveraged by existing or to-be-proposed tools for improving reliability.

(2) iComment: Using comments to detect software bugs and bad comments. When comments and code mismatch, it indicates either (1) bugs -- source code does not follow the correct comment, or (2) bad comments -- the comment is wrong or outdated, which can later lead to bugs. iComment takes the first step to detect such comment-code inconsistencies by automatically extracting specifications from comments, and then using flow-sensitive and context-sensitive static program analysis tools to check these specifications against source code. iComment has found 60 previously unknown bugs and bad comments in large software, i.e., Linux, Mozilla, Apache and Wine, and many of them have already been confirmed and fixed by the corresponding developers.

(3) aComment: Mining Annotations from Comments and Code to Detect Interrupt-Related Concurrency Bugs. To detect OS concurrency bugs, we proposed a new type of annotations interrupt related annotations and automatically generated 96,821 such annotations for the Linux kernel with little manual effort. These annotations have been used to automatically detect 9 real OS concurrency bugs (7 are previously unknown). A key technique is using a hybrid approach to extract annotations from both code and comments written in natural language to achieve better coverage and accuracy in annotation extraction and bug detection.

(4) @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies

(5) New text and program analysis for new purposes including guiding symbolic execution to test software, extracting web API specifications, generating code from API documentation, inferring synonyms, ...


Related Publications:
ISSTA-22

DocTer: Documentation-Guided Fuzzing for Testing Deep Learning API Functions. Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, Mike Godfrey. In the proceedings of ACM SIGSOFT International Symposium on Software Testing and Analysis. July 2022. Virtual.

FSE-20

C2S: Translating Natural Language Comments to Formal Program Specifications. Juan Zhai, Yu Shi, Minxue Pan, Guian Zhou, Yongxiang Liu, Chunrong Fang, Shiqing Ma, Lin Tan, and Xiangyu Zhang. In the proceedings of the the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020), November, 2020. Sacramento, California, United States. Acceptance Rate: 28% (101/360)

ICSE-20

CPC: Automatically Classifying and Propagating Natural Language Comments via Program Analysis. Juan Zhai, Xiangzhe Xu, Yu Shi, Guanhong Tao, Minxue Pan, Shiqing Ma, Lei Xu, Weifeng Zhang, Lin Tan, and Xiangyu Zhang. In the proceedings of the International Conference on Software Engineering. Seoul, South Korea. Acceptance Rate: 21% (129/617)

MSR-18

Towards Extracting Web API Specifications from Documentation. Jinqiu Yang, Erik Wittern, Annie T.T. Ying, Julian Dolby, and Lin Tan. In the proceedings of the Working Conference on Mining Software Repositories. Acceptance Rate: 33% (37/113) Won ACM SIGSOFT Distinguished Paper Award!

ICSE-16

Automatic Model Generation from Documentation for Java API Functions. Juan Zhai, Jianjun Huang, Shiqing Ma, Xiangyu Zhang, Lin Tan, Jianhua Zhao, and Feng Qin. In the proceedings of the International Conference on Software Engineering. Acceptance Rate: 19% (101/530)

FSE-16

Detecting Sensitive Data Disclosure via Bi-directional Text Correlation Analysis. Jianjun Huang, Xiangyu Zhang and Lin Tan. In the proceedings of the ACM SIGSOFT International Symposium on the Foundations of Software Engineering. Acceptance Rate: 27% (74/273) Won ACM SIGSOFT Distinguished Paper Award!

ICSE-15

DASE: Document-Assisted Symbolic Execution for Improving Automated Software Testing. Edmund Wong, Lei Zhang, Song Wang, Taiyue Liu and Lin Tan. In the proceedings of the International Conference on Software Engineering. Acceptance Rate: 18.5% (84/452)

ICSE-14

AsDroid: Detecting Stealthy Behaviors in Android Applications by User Interface and Program Behavior Contradiction. Jianjun Huang, Xiangyu Zhang, Lin Tan, Peng Wang, and Bin Liang. In the proceedings of the International Conference on Software Engineering. May-June, 2014. Hyderbad, India. (11 pages) Acceptance Rate: 20% (99/495) [BIBTEX]

LCTES-14

em-SPADE: A Compiler Extension for Checking Rules Extracted from Processor Specifications. Sandeep Chaudhary, Sebastian Fischmeister, and Lin Tan. In the proceedings of the ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems. June, 2014. Edinburgh, UK. (10 pages)

EMSE-14
(Journal)

SWordNet: Inferring Semantically Related Words from Software Context. Jinqiu Yang and Lin Tan. Accepted to the Springer Empirical Software Engineering. (28 pages) [DOI] [BIBTEX] [Data]

ICST-13

R2Fix: Automatically Generating Bug Fixes from Bug Reports. Chen Liu, Jinqiu Yang, Lin Tan, and Munawar Hafiz. In the proceedings of the International Conference on Software Testing, Verification and Validation. March, 2013. Luxembourg. (10 pages) Acceptance Rate: 25% (38/152) [BIBTEX]

ICST-12

@tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. Shin Hwei Tan, Darko Marinov, Lin Tan and Gary T. Leavens. In the proceedings of the 5th International Conference on Software Testing, Verification and Validation. April, 2012. Montreal, Quebec. (10 pages) Acceptance Rate: 26.9% (39/145). [Slides in PDF] [BIBTEX]

MSR-12

Inferring Semantically Related Words from Software Context. Jinqiu Yang and Lin Tan. In the proceedings of the Working Conference on Mining Software Repositories. June, 2012. Zurich, Switherland. (10 pages) Acceptance Rate: 28.1% (18/64).

ICSE-11

aComment: Mining Annotations from Comments and Code to Detect Interrupt-Related Concurrency Bugs. Lin Tan, Yuanyuan Zhou and Yoann Padioleau. In the proceedings of the International Conference on Software Engineering. May, 2011. Waikiki, Honolulu, Hawaii. (10 pages) Acceptance Rate: 14.1% (62/441). [Slides (no animation)] [Slides (with animation)] [BIBTEX] (Press Coverage)

ICSE-09

Listening to Programmers - Taxonomies and Characteristics of Comments in Operating System Code. (Alphabetic order) Yoann Padioleau, Lin Tan and Yuanyuan Zhou. In the proceedings of the International Conference on Software Engineering. May, 2009. Vancouver, BC. (11 pages) Acceptance Rate: 12.3% (50/405). [PS] [Slides in PDF] [BIBTEX] [Data & Software] (Press Coverage)

SOSP-07

/* iComment: Bugs or Bad Comments? */ Lin Tan, Ding Yuan, Gopal Krishna and Yuanyuan Zhou. In the Proceedings of the 21st ACM Symposium on Operating Systems Principles, October 2007. Stevenson, Washington. (14 pages) Acceptance Rate: 19.1% (25/131). [PS] [Slides in PDF] [Slides in PDF with NO animation] [BIBTEX] [In other people's words]. (Press Coverage)

HotOS-07

HotComments: How to Make Program Comments More Useful? Lin Tan, Ding Yuan and Yuanyuan Zhou. In the Proceedings of the 11th Workshop on Hot Topics in Operating Systems, May 2007. San Diego, California. (6 pages) Acceptance Rate: 20.0% (21/105). [BIBTEX]


More Related Publications:
ICSE-23

Revisiting Learning-based Commit Message Generation. Jinhao Dong, Yiling Lou, Dan Hao, and Lin Tan. In the proceedings of the International Conference on Software Engineering. May 2023. Melbourne, Australia. Acceptance Rate: 26% (208/796)

ICSE-19
(SEIP)

Towards Better Utilizing Static Application Security Testing. Jinqiu Yang, Lin Tan, John Peyton, and Kristofer A Duer. In the proceedings of the International Conference on Software Engineering, Software Engineering In Practice. Acceptance Rate: 25% (30/118)

SANER-15

CloCom: Mining Existing Source Code for Automatic Comment Generation. Edmund Wong, Taiyue Liu and Lin Tan. In the proceedings of the IEEE International Conference on Software Analysis, Evolution, and Reengineering. (10 pages) Acceptance Rate: 31.9% (46/144) [Code & Data]

ASE-13

AutoComment: Mining Question and Answer Sites for Automatic Comment Generation. Edmund Wong, Jinqiu Yang, and Lin Tan. In the proceedings of the IEEE/ACM International Conference on Automated Software Engineering, New Idea Papers. (6 pages) Acceptance Rate: 23% (74/317) [BIBTEX] [Data]