InFrame: Hide Data Bits In Video Frames

-- Achieve Simultaneous Screen-Human Viewing and Hidden Screen-Camera Communication

Inframe aims to build a novel system for screen-camera communication, along with screen-to-eye video viewing. This enables concurrent delivery of primary video content to users and additional information to devices over screen-to-camera visual links without impairing user-viewing experience.


Figure 1. The concept of InFrame.

Figure 1 illustrates the concept of InFrame. Composite contents are produced (in frames) for the display by multiplexing the video content frames (intended for human viewers) and the data (intended for devices, also in frames). These composite frames can be rendered to human eyes without affecting the viewing experience. The user thus watches the video as usual without sensing the embedded data frames. In the meantime, the data carried by the composite frames can be captured and decoded by the camera to retrieve the embedded side information.

Technical Details

InFrame achieves two concrete goals. First, the information-carrying data stream over the screen-to-camera channel should not affect the user perception of the content delivered over the primary channel. Second, we seek to achieve high data rate over the secondary screen-to-camera channel, given the heavy interference from the primary video source over the screen-to-eye channel.

Hide Data Over Primary Screen-to-Eye Channel

The core of InFrame is to leverage the capability discrepancy and distinctive features of the human vision system and devices (display and camera) . For example, screens can display content faster than human eyes perceive; cameras have shutter but human eyes do not, etc. Video and data carried by the composite frames thus operate at different time scales. Video contents are perceived at the slow pace due to physical limits of human eyes (like a low-passing filter at 40-50Hz), whereas data frames are displayed at the fast speed (here, 120 frames per second), which can be captured and decoded only by the camera. More human visual perception characteristics in temporal resolution, spatial resolution and motion sensitivity have been taken into consideration.

To make full use of this perception gap, we first propose the novel concept of spatial-temporal complementary frames (STCFs). STCFs fully exploit the spatial and temporal low-pass filtering properties of human vision system. Each of the complementary frames in STCFs contains a pair of data frames, which have complementary contents and are displayed back-to-back. STCF Every data frame is further constructed from spatially complementary visual patterns that possess alternating complementary cells (i.e., group of pixels). When data frames are multiplexed to the original screen content and displayed at fast frame rate, we can effectively suppress the visibility of data frames and preserve normal viewing experiences.

How is STCF constructed? Let us start with complementary pixels. Two pixels p and p* complement each other with respect to the luminance level v if their pixel values sum up to 2v, i.e., vp + vp* = 2v. Note that, in actual data frame, the average luminance level is set to zero (v = 0) and we have vp* = -vp. STCF_EFFECT We then construct temporal complementary frames. Pair of data frames D and D* where all their pixels are complementary pixels with respect to the luminance level v. We further develop spatial complementary patterns. Each pattern consists of spatially alternating complementary pixels (i.e. neighboring pixels are always complementary to each other). As a result, this pattern enables smoothing effects spatially. Finally, we combine spatially complementary patterns and temporally complementary frames into SCTFs.

What is the perception effect of STCFs? When STCFs are displayed on the screen at a high fresh rate (here, 120FPS), the eyes act like a low-passing filter (at 40-50Hz) and average the embedded data patterns into zero and yield the same viewing experience as usual.

The second technique to hide data over primary screen-to-eye channel is called smoothing transitional frames. This is used to tackle a critical challenge when transmitting dynamic data frames. Each STCF pair can embed and hide certain data. However, if it sharply switches from one to another (e.g., from V1-D1 to V1 + D2), strong flickers may be oticed due to the low-pass filter property of human eyes. The abrupt switching is equivalent to a quick motion and imposes severe vision interference. Smoothing transitional frames is concerning how to switch between consecutive STCF pairs. transitional

We introduce a transition function Ω(t) to gradually change the amplitude of one data frame in a transition cycle so that the luminance amplitude can switch from j to j+1 between two successive data frames Dj and Dj+1 gradually.

Boost Data Over Secondary Screen-to-Camera Channel

The unique challenge from existing screen-to-camera communication is the interference from primary video. In fact, what is available at the receiver is some artifacts ? a mixture of the original video content plus the added data frame that have undergone various screen-to-camera channel distortions such as blurring, geometrical distortion, imprecise positioning, frame rate mismatch, rolling shutter effects, etc.. With the relative small SNR (signal: aircraft, noise: primary video and others), we devise a Code Division Multiple Access (CDMA)-like modulation scheme to facilitate accurate and robust demodulation.

matrix Blocks can be formed from complementary cells in different ways and exhibit as different visual patterns. Each Block has a size of c  Cells (c is an even number) and modulated by a CDMA code. The left table shows the CDMA code sets where black indicates 1 while white indicates -1. In addition to orthogonality, CDMA codes have to satisfy some properties (details in the paper).

cdma_decode Demodulation works by comparing and selecting the maximum inner-product of the received block and all the possible codes. To handle inaccuracy of block location, we also shift the block in a relative window to get the maximum inner-product.

Besides, InFrame incorporates techniques for robust block localization and error handling. For specific errors in screen-to-camera communication (eg, at the border or at some specific rows), we propose visual guard and channel reference codes. InFrame also employ error detection and correction coding (eg, RS code and LPDC code) to further mitigate errors.

We implement InFrame prototype. The real-time encoder is implemented by GPU over a PC with a 120FPS monitor. The decoder is implemented on Android phones or a PC (connected with a phone). In ideal cases, the throughput can reach up to 360Kbps with 10% error. In most cases, we can achieve 100-200Kbps.


To be updated soon!


  • InFrame++: Achieve Simultaneous Screen-Human Viewing and Hidden Screen-Camera Communication
    Anran Wang, Zhuoran Li, Chunyi Peng, Guobin Shen, Gan Fang, Bin Zeng,
    Proceedings of the 13th International Conference on Mobile Systems, Applications and Service ( MobiSys'15 ), Florence Italy, May 2015. (acceptance rate: 29/219 = 13.2% ) [Poster]

  • InFrame: Multiflexing Full-Frame Visible Communication Channel for Humans and Devices
    Anran Wang, Chunyi Peng, Ouyang Zhang, Guobin Shen, and Bing Zeng, Proceedings of the 13th ACM Workshop on Hot Topics in Networks (HotNets-XIII ), Los Angeles, CA, Nov. 2014. (acceptance rate: 26/118 = 22.0%) [PDF] [Slide ] [Demo: InFrame Video ]

Team Members

  • Chunyi Peng(Faculty, OSU)
  • Anran Wang (MS student, Beihang University)
  • Zhuoran Li (PhD student, OSU)
  • Gan Fang (Undergraduate student, OSU)
  • Guobin Shen (Researcher, Microsoft Research Asia (MSRA))

Updated on May 18, 2015.